4 


NAVAL  POSTGRADUATE  SCHOOL 

Monterey,  California 


THESIS 


THE  INSTRUMENTATION  OF  A  PARALLEL,  DISTRIBUTED 
DATABASE  OPERATION,  RETRIEVE-COMMON,  FOR 
MERGING  TWO  LARGE  SETS  OF  RECORDS 


by 

Gregory  Alan  Hammond 
June,  1992 

Thesis  Advisor:  David  K.  Hsiao 


Approved  for  public  release;  distribution  is  unlimited 


92-06727 


Best 

Available 

Copy 


Unclassified 


SECURITY  Cl  ASSIFICA'I  ION  Oh  '11  US  PAGE 


REPORT  DOCUMENTATION  PAGE 

1  a.  REPOR  T  SECURITY  CLASSIFICATION 

Unclassified 

1  b.  RESTRIC11VI-  MARKINGS 

2a.  SECURI'IY  CLASSIFICATION  AUTHORITY 

3.  DISTRIBUTION/ AVAILABILITY  OF  REIYJKT 

Approved  for  public  release;  distribution  is  unlimited. 

2b.  DCLASSIF TCAllON.TOWNGRADLNG  SCI  IED11.E 

PERFORMING  ORGANIZATION  REPOR  i  N  UMBERtS) 


5 .  MONTIORING  ORGANIZATION  REPORT  MMBhRlS) 


6a.  NAME  OF  PERFORMING  ORGANIZATION 
Computer  Technology  Curriculum 
Naval  Postgraduate  School 


( v .  ADDRHSS  (city,  state,  and ZIP  cod*./ 

Monterev,  CA  03943-5000 


S  a .  name  of  funding, 'SI'onsorjng 

ORG  AMXATION 


Sc  ADDRESS  (cii\  stale ,  arui // P  code  • 


6h  OFFICE  SVMLtOF  7a  NAME  OF  MONITORING  ORGANIZA'ITON 


(7/  Applicable  i 

37 


6K  OFTTCE  SYMBOL 
(If  Applicable ) 


Naval  Postgraduate  School 


7b.  ADDRESS  (city,  stale,  and  7/P  codi  , 


Monterev,  CA  93943-5000 


PROCURE MEN  1  1NSTRI MENT  IDEM  IHCA  DON  NL  MHEK 


10.  SOURCE  OF  FUNDING  V  MBEKS 


PROGRAM 

PROJECI 

TASK 

WORK  UNI'I 

ELEMENT  NO. 

NO. 

NO. 

ACCESSION  NO. 

1  1  T1 1  l.F  ' Ir.clutie  Security  Classification) 

THE  INSTRUMENTATION  OF  A  PARALLEL,  DISTRIBUTED  DATABASE  OPERATION,  RETRIEVE-COMMON, 
FOR  MERGING  TWO  LARGE  SETS  OF  RECORDS 


PERSONA!.  AUniOR(S)  HAMMOND  GREGORY  ALAN 


13b.  TIME  COVERED 
FROM  Jan  1991  to  Feb  1992 


14.  DATE  OE  REPORT  /year,  month. dau  1?.  PAGE  COIN  I 

JUNE  1992 


13a.  '[YPEOF  REPORT 
Master's  Thesis 


1 6.  SUPPLEMENTARY  NOTATION 

The  views  expressed  in  this  thesis  arc  those  of  the  author  and  do  not  reflect  the  official  policy  or  position  of  the  Department  of 
Defense  or  the  U.S.  Government. 


17.  COS  A II  CODES  18.  SUBJECT  TERMS  ( continue  on  reverse  if  necessary  and  identify  by  block  number ) 

HELD  |  GROUP  |  SUBGROUP  Distributed  Database,  Parallel  Database,  Record  Merging,  Multi-backend 


1  y.  ABSTRACT  (Continue  on  reverse  if  necessary  and  identify  by  block  number) 

The  Naval  Postgraduate  School  s  Laboratory  for  Database  Systems  Research  is  the  site  of  the  multi  backend  database 
supercomputer  (MBDS).  Originally,  MBDS  supported  a  prototype  primary  operation  (retrieve-common)  which  merged  two 
sets  of  records  in  a  distributed,  parallel  database  environment.  This  thesis  presents  the  testing,  and  modification  of  that 
prototyped  primary  operation. 

First,  the  design  rationale  of  the  MBDS  is  reviewed.  Specifically,  this  review  examines  the  reasons  for  a  database- 
oriented  supercomputer,  the  MBDS  primary  processes,  and  the  methodology  of  distributing  a  database  within  loosely  coupled 
and  highly  parallel  database  stores.  Then,  this  study  explains  the  methodology  involved  in  developing  theories  on  the  cause 
of  retrieve-common's  defects  and  bottlenecks.  Finally,  in  validating  our  theories,  this  study  relates  the  process  of  discovering 
and  correcting  these  discrepancies. 


30  DISTRIBUT  IOWAVAIIjXHHJTY  OF  ABSTRAC1 

_X  \  UNCt.ASSlFIFD'YNLIMrrFD  [J  SAME  AS  RlrF  H  DTIC  USERS 


NAME.  OF  RESPONSIBLE  INDIYIDLAI . 
DAVID  K.  HSIAO 


Dl)  FORM  1473.  K4  MAR  S3  APR  edition  ms1 


21 .  ABSTRACT  SECURITY  CLASSIFICATION 
Unclassified 


22b  TELEPHONE  (Include  Area  Code)  22c,  OFFICE  SYMBOL 
(408)  646-225.3  CS/Hq 


S3  APR  edition  mas  be  used  until  exhausted  SECURITY  CLASSIFICATION  OF  11  IIS  PAGE. 


1 


Approved  for  public  release;  distribution  is  unlimited. 


The  Instrumentation  Of  A  Parallel,  Distributed  Database  Operation, 
Retrieve-Common,  for  Merging  Two  Large  Sets  Of  Records 

by 


Gregory  Alan  Hammond 
Lieutenant,  United  States  Navy 
B.S.,  California  State  University,  1983 


Submitted  in  partial  fulfillment  of  the  requirements 
for  the  degree  of 

MASTER  OF  SCIENCE  IN  COMPUT1  i  SCIENCE 

from  the 


Author: 


Approved  by: 


NAVAL  POSTGRADUATE  SCHOOL 

June  1992 


'•  / 


/,  /  . 

/«  i-r  n  '• 


/ 


Qregory  A.  Hammond 


l  A 

David  K 


h  ' A 


Hsiao,  Thesis  Advisor 
Professor  of  Computer  Science 


Thomas  C.  \Vp<  Second  Reader 
Associate  Professbr  of  Computer  Science 


[i  l 

a  _  '  ii 


l. 


Robert  B.  McGhee,  Chairman.  Department  of  Computer 

Science 


it 


ABSTRACT 


The  Naval  Postgraduate  School's  Laboratory  for  Database  Systems  Research 
is  the  site  of  the  multi-backend  database  supercomputer  (MBDS).  Originally, 
MBDS  supported  a  prototype  primary  operation  (retrieve-common)  which 
merged  two  sets  of  records  in  a  distributed,  parallel  database  environment.  This 
thesis  presents  the  testing,  and  modification  of  that  prototyped  primary 
operation. 

First,  the  design  rationale  of  the  MBDS  is  reviewed.  Specifically,  this 
review  examines  the  reasons  for  a  database-oriented  supercomputer,  the  MBDS 
primary  processes,  and  the  methodology  of  distributing  a  database  within  loosely 
coupled  and  highly  parallel  database  stores.  Then,  this  study  explains  the 
methodology  involved  in  developing  theories  on  the  cause  of  retrieve-common’s 
defects  and  bottlenecks.  Finally,  in  validating  our  theories,  this  study  relates  the 
process  of  discovering  and  correcting  these  discrepancies. 
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I.  AN  INTRODUCTION  TO  A  SUPERCOMPUTER-DATABASE 

MACHINE 


The  increasing  desire  to  access  and  manipulate  greater  amounts  of  complex 
information  ha1-'  led  researchers  to  search  for  methods  of  improving  the 
performance  of  the  Database  Management  System  (DBMS).  An  area  that  shows 
increasing  promise  is  a  DBMS  that  can  perform  operations  in  parallel. 

A.  SUPERCOMPUTERS  FOR  NUMERICAL  COMPUTATIONS 

The  use  of  parallel  operations  in  a  conventional  supercomputer  for  speeding 
up  computations  is  not  new.  There  are  many  production-level,  numerical- 
oriented  supercomputers.  However,  these  types  of  supercomputers  are  not 
effective  with  operations  that  involve  database  structures.  Lazou  [Ref.  1J 
concurred  with  our  observation  by  stating  that  conventional  supercomputers  are 
designed  for  maximizing  speeds  in  calculating  floating-point  numbers.  To  fulfill 
the  requirement  of  fast  computations,  these  types  of  supercomputers  have  been 
specifically  designed  with  a  multiplicity  of  scalar  or  vector  functional  units  and 
CPI  s.  They  are  designed  to  receive  operands  and  deliver  results  under  parallel 
conditions.  The  capabilities  of  these  scalar  or  vector  functional  units  are  limited, 
since  they  are  restricted  to  numerical  operations  only.  This  limitation  to 
numerical  operations  means  the  database  operations  will  not  be  able  to  take 
advantage  of  the  parallel  processing  capability  of  the  conventional 
supercomputer. 

In  addition  to  the  limited  capabilities  of  the  functional  units,  the  CPUs  are 
not  effective  for  database  operations  either.  Very  few  database  problems  fall 
within  the  characteristics  that  take  advantage  of  multiple  CPUs  of  a  numerical 


supercomputer.  Specifically,  a  conventional  supercomputer's  CPUs  require  a 
computational  problem  to  be  sectioned  into  small  and  parallel  portions.  Standard 
database  operations  (  e.g..  retrieve  and  update)  cannot  be  divided  into  small  and 
parallel  portions  for  numerical  processing,  since  database  operations  are  mostly 
non-numerical. 

B.  SUPERCOMPUTERS  FOR  DATABASE  MANAGEMENT 

The  supercomputer  designed  to  provide  parallel  operations  for  a  DBMS  can 
be  found  in  the  Multiple  Backend  Database  Supercomputer  (MBDS).  As  a 
prototype  system,  the  MBDS  is  developed  to  provide  the  necessary  architecture 
for  performance  gains  and  capacity  growth  via  para'lel  database  operations. 
Performance  gains  for  the  same  transaction  are  obtaine  oy  increasing  the  degree 
of  parallelism  in  database  management.  Capacity  grow  ths  may  be  facilitated  for 
the  same  response  time,  if  the  degree  of  parallelism  is  proportional  to  the 
database  erow  th. 

*w 

MBDS  utilizes  dedicated  computers  (called  database  backends)  configured 
from  multiple,  identical,  and  off-the-shelf  microcomputers,  each  of  which  has  its 
own  external  storage  devices.  The  architecture  of  MBDS  is  illustrated  in 
Figure  1. 

The  architecture  illustrated  in  Figure  1  is  scalable  because  it  introduces 
parallel  backends  and  their  stores  in  proportion  to  the  performance  gains  and 
capacity  growth  desired.  More  precisely,  this  architecture  allows  system 
processes  to  be  replicated  onto  new  and  additional  backend  computers.  These 
replications  allow  parallel  processing  of  database  transactions  and  parallel 
accesses  to  the  database. 


Base 

data 

disks 


Figure  1.  The  Multibackend  Database  Supercomputer 


These  parallelisms  of  MBDS  have  been  shown  to  improve  the  performance 
of  DBMS  substantially  and  proportionally. 

C.  THE  PROCESSES  OF  THE  MUET I  BACKEND  DATABASE 
SUPERCOMPUTER  SYSTEM 

MBDS  software  (i.e..  processes)  functions  are  discussed  in  two  major 
subsections:  the  controller  subsection  and  the  backend  subsection. 

1 .  Controller  Processes 

The  controller  computer  supports  five  main  processes  which  direct  the 
operation  of  the  controller  computer.  These  processes  are  known  as  Request  or 
Transaction  Processing  (TP;,  Post  Processing  (PP).  Insert-Information- 
Generator  (IIG),  Put,  and  Get.  TP  interfaces  with  the  user,  identifies  the  user 
and  pre-processes  each  transaction.  Specifically,  each  transaction  is  parsed, 
checked  for  syntax  errors,  and  formatted.  Upon  completion  of  this  pre¬ 
processing.  TP  broadcasts  the  transaction  to  all  of  the  backends  which  in  turn 
store  the  incoming  transaction  in  their  respective  transaction  queues.  PP  also 
interfaces  with  the  user.  It  provides  transaction  results  to  the  user. 

To  ensure  that  each  transaction  is  returned  to  the  correct  user,  PP 
maintains  the  ability  to  interact  with  TP  to  match  transaction  responses  to 
appropriate  users.  Additionally,  PP  performs  aggregate  functions  on  data 
returned  from  the  backends.  For  example,  summations  and  averaging  are 
conducted  on  the  data  that  have  been  provided  to  PP. 

Get  and  Put  provide  the  controller  with  the  capability  to  communicate 
via  the  Ethernet  to  the  processes  residing  on  the  backend  computers. 
Specifically.  Get  allows  the  receipt  of  information  from  the  backends.  When 
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communicating  with  the  backends.  Put  allows  the  transmission  of  information  in 
the  one-to-one  or  one-to-many,  i.e.,  broadcasting  mode. 

Finally,  IIG  is  considered  a  critical  process  of  tiie  controller.  This 
process  is  responsible  for  the  even  placement  of  record  clusters  into  the  database 
stores  of  the  backends.  The  concept  and  importance  of  the  record  cluster  will  be 
elaborated  on  in  a  later  section.  Here,  we  consider  it  simply  as  a  record  set.  IIG 
first  determines  the  backend  into  which  a  record  is  to  be  inserted.  This 
determination  is  completed  by  using  the  space  utilization  table  which  maintains 
the  disk-track  information  of  all  the  backends'  base-data  disks.  When  an 
appropriate  track  is  determined,  IIG  directs  the  loading  of  records  into  the  track 
of  a  backend.  Following  the  insertion.  IIG  directs  the  updating  of  the  tables  in 
the  meta-data  disks  as  required.  IIG's  space  utilization  table  provides  the 
following  information: 

a.  It  identities  the  backends  that  contain  the  first  and  last  trackful  of  records 
of  a  particular  cluster. 

b.  It  identifies  the  backends  that  can  prov  ide  new  tracks  for  new  records  of  a 
cluster. 

2.  Backend  Processes 

In  a  backend  computer,  there  are  five  processes  that  direct  all  the 
backend  operations.  These  processes  are  Directory  Management  (DM),  Record 
Processing  (RP),  Concurrency  Control  (CC),  Get.  and  Put. 

DM  is  responsible  foi  managing  and  accessing  meta  data,  i.e..  contains 
information  about  base  data  For  example,  a  descriptor  has  the  value  range  of  a 
particular  attribute  in  the  base  data.  Upon  the  receipt  of  a  query  of  a  transaction 
from  TP,  DM  in  each  backend  takes  the  keywords  of  the  query,  and  searches  the 
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meta  data  store  for  the  matching  descriptors.  When  the  appropriate  descriptors 
are  located,  it  determines  the  clusters  (if  any)  to  which  the  records  belong.  This 
information  is  then  transmitted  to  RP. 

RP  is  responsible  for  the  access  and  manipulation  of  records. 
Specifically,  RP  performs  record  retrieval,  selection  (based  on  additional 
attribute-value  pairs  of  the  query),  and  the  extraction  of  attribute  values. 
Therefore,  it  is  intricately  involved  with  the  disk  input/output  operations. 

CC  is  responsible  for  maintaining  meta-data  and  base-data  integrity  during  the 
execution  of  user  requests  or  transactions.  Because  the  data  requirements  of  user 
requests  may  overlap,  it  is  important  that  the  data  consistency  is  maintained  while 
requests  are  being  processed.  There  is  no  CC  function  in  the  controller  because 
all  of  the  user  requests  are  fulfilled  by  the  backends.  Here,  Get  and  Put  provide 
the  same  communication  capabilities  as  Get  and  Put  of  the  controller.  Figure  2 
illustrates  the  relationship  of  the  controller  processes  and  the  backend  processes. 

D.  THE  CLUSTERS  OF  THE  MBDS  DATABASE 

The  replication  of  DBMS  functions  onto  independent  and  parallel  backends 
is  the  first  step  in  providing  parallel  operations  for  a  multiuser  DBMS.  The 
second  step  is  related  to  the  accessibility  of  the  database  stores.  In  a  conventional 
DBMS,  accesses  are  always  made  to  a  common  database  store.  This  mode  of 
accesses  is  considered  adverse  to  parallel  operations.  However,  the  adversity  of 
accessing  a  common  database  store  is  directly  related  to  the  system's 
requirements  to  maintain  data  consistency. 
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Figure  2.  The  Organization  of  MBDS  Processes 
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In  a  multiuser  DBMS,  the  stored  data  items  are  the  primary  resources  that 
may  be  accessed  concurrently  by  user  transactions.  These  user  transactions 
retrieve  and  modify  data  that  is  present  in  that  database  store.  They  can  be 
executed  concurrently  and  may  access  and  update  the  same  database.  If  this 
concurrent  execution  is  not  controlled,  it  may  lead  to  an  inconsistent  database, 
i.e..  a  database  with  incorrect  information  [Ref.  2).  A  technique  to  control 
concurrent  executions  of  transactions  is  based  on  the  locking  concept.  Elmasri 
[Ref.  2]  defines  a  lock  as  a  variable  associated  with  a  data  item  in  the  database. 
This  variable  describes  the  status  of  that  data  item  with  respect  to  possible 
operations  that  can  be  applied  to  it.  Essentially,  read  locks  allow  transactions 
that  do  not  modify  the  data  to  have  accesses  with  other  tnsactions  involved  with 
reading  only.  However,  transactions  that  arc  involved  with  writing  can  only 
have  accesses  to  data  if  no  read  or  write  locks  exist  over  the  data.  The  write 
locks  do  not  allow  any  other  transactions  to  have  any  access  to  the  data.  In 
general,  the  locking  mechanism  ensures  that  the  integrity  of  the  database  store  is 
maintained  by  controlling  accesses  to  the  store. 

Locking  is  just  one  of  the  many  concurrency  control  methods;  however,  it 
highlights  the  adverse  characteristic  of  using  a  common  database  store.  If  MBDS 
were  to  utilize  a  common  database  store,  the  backends  would  experience  delays 
due  to  being  locked  out  of  information  in  the  common  store  necessary  to 
complete  a  transaction.  Therefore,  performance  gains  by  using  multiple  and 
parallel  computers  would  be  nullified.  The  solution  to  this  obstacle  is  to  develop 
a  method  that  would  evenly  distribute  (partition)  the  contents  of  "the  common 
database"  to  the  multiple  database  stores  -  one  for  each  backend. 
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1 .  The  Partitioning  of  the  Database 

A  Partition  of  a  set  A  consists  of  the  subdivision  of  A  into  a  collection 
of  subsets  which  are  pair-wise  disjoint  and  whose  union  is  A.  The  use  of 
partitions  ensures  that  each  backend  performs  its  operations  on  a  unique  subset  of 
the  database  on  its  own  database  store.  Therefore,  the  parallelism  may  be 
maintained  without  performance  degradation,  since  there  is  no  contention  over  a 
single  common  store.  Instead,  all  the  parallel  operations  are  performed  on  their 
database  partitions  parallelly. 

The  technique  used  to  partition  the  records  is  based  on  the  notion  of  an 
equivalence  relation.  The  ideal  behind  an  equivalence  relation  is  that  it  is  a 
classification  of  objects  w  hich  are  in  some  way  "alike.”  The  formal  definition  of 
an  equivalence  relation  [Ref.  3[  is  as  follows:  A  relation  on  a  set  is  an  equivalence 
relation  if  it  is  reflexive,  symmetric,  and  transitive  on  elements  of  the  set. 

The  properties  of  reflexive,  symmetric  and  transitive  is  presented  below 
for  the  set  F  where  the  relationship  is  represented  by  the  symbol  &  . 

a.  The  relation  &  is  reflexive.  If  for  each  a  that  is  a  member  of  F,  the 
following  is  true:  a  &  a  . 

b.  The  relation  is  symmetric.  If  for  each  a  and  b  that  are  members  of 

F,  the  following  is  true:  a  &  b  implies  b  &  a  . 

c.  The  relation  <£  is  transitive.  If  for  each  a  ,  b,  and  c  that  are  members  of 

F.  the  follow  ing  is  true:  a  &  b  and  b  &  c  implies  a  &  c. 

An  abstract  example  presenting  cases  where  a  relationship  does  not  fulfill  the 
equivalence-relation  requirements  (transitive,  reflexive  and  symmetric)  is 
presented  below : 
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Consider  the  relation  TT  =  {(1.1),  (1,2),  (2,1),  (2.3) }  on  the  set  A  =  {1,2,3). 


a.  Both  1  and  2  are  members  of  A;  however,  (2,2)  is  not  a  member  of  the 
relationship  set  TT,  although  (1,1)  is  in  TT.  Therefore  TT  is  not 
reflexive. 

Sinee  a  relation  must  be  symmetric,  transitive  and  reflexive  to  be  an  equivalence 
relation,  TT  is  not  an  equivalence  relation. 

The  notion  of  equivalence  relations  is  used  because  it  allows  us  to 
broaden  the  notion  of  equality  from  identity.  Elements  are  judged  on  similarity 
based  on  being  alike  relative  to  a  common  property.  As  stated  in  | Ref.  3)  "  two 
elements  need  not  be  identical  to  be  equivalent;  they  need  only  to  share  a 
specified  property.”  This  sharing  of  a  specific  property  allows  us  to  explain  the 
interrelationship  of  equivalence  relations,  equivalence  classes,  and  partitions. 

The  formal  definition  of  an  equivalence  class  [Ref.  3j  is  as  follows:  "Let 
~  be  an  equivalence  relation  on  a  set  A.  For  each  a  that  is  a  member  of  A.  the 
equivalence  class  of  a  is  the  subset,  denoted  by  [«],  consisting  of  all  elements  x 
of  A  that  are  equivalent  to  a  ,  i.  e.  .  x  ~  a"  This  definition  allows  us  to  review  a 
theorem  provided  in  [Ref.  3J  which  presents  the  basic  properties  among  elements 
of  an  equivalence  relation.  Specifically,  the  theorem  assumes  that  ~  is  an 
equivalence  relation  on  a  set  A  and  that  elements  x,  y  are  members  of  A,  the 

following  rules  apply  to  ~  : 

a.  If  x  ~  y  is  true,  then  [*  ]  =  [>•]. 

b.  If  not  (x  ~  y)  is  true,  then  the  intersection  of  [.r]  and  [v]  is  empty. 

c.  The  union  of  all  the  equivalence  classes  of  ~  is  A. 

The  interrelationship  of  partitions  and  equivalence  relations  becomes 
evident  when  we  invoke  the  aspect  of  equivalence  classes.  The  rules  of 
equivalence  classes  indicates  that  for  any  equivalence  relation  -  on  a  set  A,  the 
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set  of  distinct  equivalence  classes  of  A  modulo  ~  constitutes  a  partitioning  of  A. 
This  stipulates  that  for  every  equivalence  relation  on  a  set  A,  there  exist  a 
corresponding  partition  of/1  in  terms  of  those  equivalence  classes  [Ref.  3j. 

2.  The  Distribution  of  a  MBDS  Database 

The  determination,  that  (1)  equivalence  classes  develop  database 
partitions  and  that  (2)  the  union  of  these  database  partitions  provide  the  whole 
database,  is  the  foundation  of  our  database  distribution  methodology.  The 
distribution  methodology  develops  similarities  by  using  common  attributes  and 
the  attribute-value  ranges  of  the  records  within  the  database.  These  attributes 
and  ranges  are  used  to  develop  an  equivalence  relation  and  its  corresponding 
equivalence  classes.  The  equivalence  classes  develop  mutually  exclusive 
partitions  (called  clusters  in  MBDS  ).  These  clusters  allow  the  even  distribution 
of  a  database  onto  the  backends'  stores  of  MBDS. 

The  clusters  are  distributed  onto  the  backends  based  on  an  one  track- 
per-backend-store  algorithm.  A  cluster  of  records  are  inserted  onto  a  backend's 
database  store  (disks)  until  the  track  is  full.  When  it  cannot  receive  any  more 
data,  then  another  backend's  database  store  is  selected  to  receive  the  next  track  of 
the  clustered  data.  For  example,  if  a  track  on  the  database  store  of  backend 
number  three  is  full,  then  the  database  store  of  backend  number  four  will  be 
selected  to  receive  the  next  track  of  clustered  data.  The  algorithm,  which  is 
embedded  in  the  1 10  process,  determines  the  next  database  store  of  a  backend 
modulo  the  number  of  backends.  Figure  3  illustrates  the  distributing  of  the 
records  to  the  database  stores,  i.e.,  external  storage  devices. 


1 1 


MDBS  CLUSTER 


Figure  3.  MBDS  Distribution  Strategy 


The  development  of  the  method  to  evenly  distribute  clustered  records  into 
the  datastore  allows  the  extensive  and  scalable  architecture  of  Figure  1  to  be 
effective.  The  MBDS  allows  every  backend  to  process  the  same  transaction 
simultaneously.  Each  backend  only  needs  to  know  the  base  data  contained  in  its 
database  store.  This  architecture  is  the  foundation  of  the  MBDS  parallel 
processing  capability:  which  incurs  no  delays  and  no  lockouts  in  parallel  accesses 
to  the  commonly  clustered  database. 

E.  THE  MBDS  PRIMARY  OPERATIONS 

There  are  five  primary  database  operations  in  MBDS.  They  are  Retrieve. 
Delete.  Update,  Insert  and  Retrieve-Common.  The  primary  operations. 
Retrieve.  Update  and  Delete,  operate  on  a  set  of  records  at  a  time,  while  Insert 
operates  on  a  single  record  at  a  time.  The  retrieve-common  primary  operation 
is  different  from  other  primary  operations.  It  manipulates  two  sets  of  records. 
This  manipulation  of  two  sets  of  records  leads  to  the  uniqueness  of  the  primary 
operation.  Each  of  these  sets  of  records  is  determined  by  an  independent  query. 
These  distinct  sets  of  records  are  then  merged  on  the  basis  of  a  common  set  of 
attributes  values  specified  by  the  user.  In  Figure  4.  we  present  a  sample 
retrieve-common  transaction  for  illustration. 

This  sample  retrieve-common  will  merge  census  records  with  common 
names  of  U.  S.  cities  and  Canadian  tow  ns.  The  output  would  be  the  names  of  the 
city  or  town  and  their  respective  population  figures. 


The  first  query  is  for  the  source  file. 


RETRIEYE(FILE  =LScensus)  (CITY,POPELATIO\ ) 
COMM()N(CITY,TO\VN) 

RETRIEYE(FILE=Canadacensus>  (POPULATION) 


The  second  query  is  for  the  target  file. 


The  common  attribute  values  that  would 
be  used  to  merge  the  two  files. 


Figure  4.  A  Sample  Retrieve-Common  Transaction. 


1.  The  Comparison  of  The  Retrieve-Common  and  Equi-join 

The  retrieve-common  primary  operation  is  equivalent  to  the  relational 
equi  join  operation.  However,  differences  do  exist.  pecifically,  an  equi-join 
manipulates  two  sets  of  relations  in  a  single  DBMS  with  only  one  computer 
[Ref.  2J.  When  the  appropriate  tuples  of  these  relations  are  collected,  they  are 
merged  into  a  new  relation.  This  new  relation  is  then  provided  to  the  user  as  the 
result  of  the  user's  query.  A  retrieve-common,  however,  is  designed  to  operate 
in  a  parallel  DBMS  on  an  incrementable  number  of  backend  computers. 
Specifically,  while  conducting  such  an  operation,  clustered  records  on  each 
backend  are  being  searched  for  records  whose  attribute  value  pairs  fulfill  the 
user  query.  When  that  search  is  completed,  however,  the  backends  cannot 
consider  the  user's  query  to  be  satisfied  merely  by  merging  the  appropriate 
records  on  common  attribute  pairs.  As  highlighted  in  our  discussion  of  the 
database-store  distribution,  each  backend  only  contains  a  partition  (subset)  of  the 
database.  Therefore,  to  ensure  that  an  adequate  merge  of  attribute  values  pairs 
does  occur,  the  retrieve-common  allows  backends  to  share  their  individual 
partitioned  data.  This  provision  is  accomplished  by  the  transmission  of  one's 
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partitioned  data  to  other  backends.  Provisions  of  the  equivalence  classes  ensure 
that  the  sharing  of  partitioned  data  (i.e.,  clustered  data)  in  this  manner  maintains 
the  integrits  of  the  database  partition  (or  cluster).  All  appropriate  attribute 
value  pairs  will  be  reviewed  before  a  final  result  is  provided  to  the  user.  The 
reliance  on  the  notion  of  the  equivalence  classes,  the  subdivision  of  the  database 
into  partitions,  and  retrieval  and  sharing  of  partitioned  data  from  individual 
backends  is  an  intricate  element  in  the  design  of  the  retrieve-common.  Without 
these  capabilities.  MBDS  w  ill  not  be  able  to  conduct  parallel  merges. 

Due  to  its  operational  complexity  and  parallel  nature,  the  retrieve- 
common's  coordination,  communication  and  query  processing  requirements 
exceed  the  requirements  of  an  equi-join. 

2.  The  Retrieve-Common  Algorithm 

The  algorithm  is  provided  in  the  single-query-multiple-data-stream 
mode  as  follow  s: 

a.  The  controller  w  ill  broadcast  the  retrieve-common  transaction  to  all  the 

backends  to  be  inserted  into  their  respective  transaction  queues. 

b.  For  that  transaction,  each  backend  will  retrieve  its  first  set  of  clustered 
records  (called  source  records)  from  the  first  query  of  that  transaction. 

c.  For  each  record  retrieved,  each  backend  would  hash  the  record  into  its 
virtual  memory  based  on  the  common  attribute  value  of  the  record  This 
process  would  continue  until  all  of  the  retrieved  records  are  hashed  into 
its  virtual  memory. 

d.  Each  backend  w  ill  now  retrieve  the  second  set  of  clustered  records  (called 

target  records)  that  fulfill  the  second  query  of  the  transaction. 

e.  For  each  of  these  target  records  retrieved,  the  common  attribute  value  is 

hashed  to  provide  a  virtual  memory  address.  At  that  point,  the  records  of 
that  virtual  memory  address  are  fetched  one  by  one  and  compared  against 
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this  record.  If  they  do  compare,  then  they  are  merged  and  prepared  for 
output. (see  step  h.).  This  process  continues  until  all  records  of  the  second 
set  have  been  retrieved,  compared,  and  processed. 


('.  Each  backend  then  broadcasts  its  second  set  of  clustered  records  to  all  the 
other  backends. 

g.  For  each  record  received  via  broadcasting,  each  of  the  backends  will 
repeat  step  e.  The  process  of  broadcasting  target  records  to  the  other 
backend^  w  ill  continue  until  a  flag  indicating  completion  is  received. 

h.  Finally,  each  backend  will  merge  their  source  records  (which  met  the  first 
query)  with  the  target  records  (  which  met  the  second  query  )  and  outputs 
the  results  to  the  controller. 

F.  THE  AIM  AND  INTENT  OF  THE  THESIS 

The  preceding  introduction  of  the  architecture  and  design  rationale  of  MBDS 
allows  us  to  state  the  aim  and  scope  of  this  thesis.  Presently,  the  implementation 
of  retrieve-common  is  defective.  It  only  allows  the  manipulation  of  a  small 
database.  When  the  database  reaches  a  size  that  is  appropriate  for  reasonable 
database  operations.  MBDS  fails.  Before  the  completion  of  this  thesis  the  cause 
of  this  failure  was  unknown. 

The  aim  of  this  thesis  is  to  develop  a  theory  to  explain  the  cause  of  the 
defective  retrieve-common  operation  and  to  correct  the  defect.  The  thesis  will 
determine  whether  the  defective  operation  is  the  result  of  architectural 
deficiencies,  inadequate  hardware  support,  a  defective  algorithm,  or  erroneous 
implementation.  When  such  deficiencies  are  identified,  this  thesis  will  present 
the  appropriate  correction.  The  final  intent  of  this  thesis  is  to  provide  a 
methodology  to  troubleshoot  (debug)  very  large  parallel  systems.  The  increasing 
importance  of  conducting  parallel  operations  accentuates  the  necessity  of  having 
an  effective  methodology  for  debugging  parallel  operations. 
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G.  THE  ORGANIZATION  OF  THE  THESIS 


The  remaining  parts  of  the  thesis  are  organized  as  below : 

Chapter  II  evaluates  whether  or  not  architectural  deficiencies  exist  in  the  present 
implementation  of  the  retrieve-common.  The  results  of  that  evaluation  can 
direct  the  development  of  theories  regarding  the  cause(s)  of  the  defective 
retrieve-common  operation.  Chapter  III  discusses  the  documentation  which  has 
been  developed  to  appropriately  evaluate  (i.e.,  debug)  a  complex  parallel- 
backend. multiprocess-based  system  such  as  MBDS.  Additionally,  Chapter  III 
determines  which  of  the  defect  theories  have  merit  and  presents  corrections  that 
have  been  implemented  to  resolve  those  defects.  Chapter  IV  presents  our 
findings,  and  provides  directions  towards  future  research. 
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II. 


THE  DEVELOPMENT  OE  THEORIES  OI  DEFECTS 


A.  A  STUDY  OF  HARDWARE  LIMITATIONS  AND  SOFTWARE 

ALGORITHMS 

Early  research  indicates  that  three  methods  were  proposed  for  implementing 
the  retrieve-common  in  MBDS  [Ref.  4j.  The  primary  consideration  behind  each 
of  these  methods  involves  the  location  for  the  merging  of  two  sets  of  retrieved 
data.  The  methods  are  reviewed  briefly  here: 

Method  1.  The  controller  does  the  entire  merge  operation. 

Method  2.  The  controller  and  the  backends  share  the  workload  of  the 
merge. 

Method  3.  The  backends  do  the  entire  merge  operation. 

The  first  and  second  methods  were  discounted  because  they  violated  the 
major  design  goal  of  MBDS:  to  minimize  the  work  and  involvement  of  the 
controller.  The  designer  believes  that  by  minimizing  the  controller  interaction 
(a)  greater  levels  of  parallel  operations  by  the  backends  are  possible  and  (b)  less 
likely  that  the  controller  will  cause  a  bottleneck.  Since  more  activities  can  be 
completed  parallelly  in  the  individual  backends,  there  is  no  need  to  do  them 
serially  in  the  controller.  Additionally,  allowing  the  controller  to  complete  the 
merge  operation  can  provide  the  possibility  of  a  bottleneck  at  the  controller. 
This  bottleneck  can  result  in  two  ways:  through  the  transmissions  from  the 
various  backends,  and  from  the  interactions  with  the  frontend  computer. 
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Thus,  the  first  two  methods  were  eliminated.  Method  three  is  the  basis  for 
the  design  and  implementation  of  retrieve-common  that  is  presented  in  Chapter  I 
which  does  not  have  the  limitation  of  either  method  1  or  2  as  articulated  above. 

The  defective  performance  of  retrieve-common  generates  doubts  about  the 
merit  of  the  backend-based  method  three.  Theoretically,  the  system  architecture 
in  Figure  1  is  sufficient  for  completing  the  backend  based  merge  opera' :on 
| Ref.  5|.  However,  the  system’s  inability  to  manipulate  large  amounts  of  data 
from  database  stores  in  retrieve-common  provides  a  justification  for  review  of 
the  system  hardware  performance  under  aforementioned  methods.  We 
hypothesis  l  that  the  hardware  limitation  of  the  backends  could  reduce  the 
performance  of  the  backend-based  merge  operation,  i.e.,  method  three.  On  the 
other  hand,  the  controller  bottleneck  discussed  earlier  in  the  controller-based 
merges  may  have  smaller  ramifications  titan  anticipated.  We  also  consider  the 
possibility  that  the  hardware  used  to  implement  the  primary  operation  may 
include  restrictions  for  parallel  processing.  These  restrictions  may  favor  the 
controller-based  implementation  of  retrieve-common,  since  it  is  a  serial 
processor. 

The  hypothesis  that  hardware  limitations  may  invalidate  the  merit  of  the 
backend-based  merge,  i.e.,  method  three,  has  been  found  to  be  untrue.  The 
hardware  characteristics  of  the  MBDS  system  [Ref.  6]  do  not  provide 
performance  restrictions  on  method  three.  Based  on  kemal  program  results,  we 
observe  that  the  backend- based  merge  outperforms  the  controller  based  merge 
by  about  60  percent.  Additionally,  we  observed  that  the  present  algorithm  is 
implemented  according  to  the  designer's  specifications. 
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Our  determination  that  the  backend-based  retrieve-common  algorithm  is  not 
effected  negatively  by  the  present  hardware  elements  of  MBDS  allows  us  then  to 
review  the  software  implementation. 

B.  TOWARDS  THEORIES  OF  DEBUGGING 

Since  the  retrieve-common  algorithm  utilizes  a  number  of  system  processes, 
a  thorough  understanding  of  the  individual  processes  as  well  as  their 
interrelationships  is  necessary.  The  interrelationship  of  the  major  processes 
ensures  that  any  modification  to  one  will  affect  the  other  system  processes 
accordingly.  Modifications  are  not  restricted.  But,  a  thorough  understanding  of 
the  processes  and  their  interrelationship  is  required  prior  to  any  attempt  to 
determine  and  correct  implementation  errors.  Witho  this  understanding,  we 
may  fail  to  determine  the  deficiency  and  make  the  corrections. 

1.  Conducting  Test  Runs 

The  first  step  is  to  develope  a  theory  regarding  the  deficiency  of 
retrieve-common  and  the  interrelationship  of  system  processes  by  conducting  test 
runs  of  the  MBDS  system.  The  test  runs  indicate  that  the  MBDS  system  operates 
for  all  five  primary  operations.  Moreover,  the  retrieve-common  performs 
incorrectly  only  beyond  certain  amounts  of  retrieved  data  from  the  database 
stores.  An  initial  hypothesis  is  ascertained  from  these  tests.  We  conclude  that 
the  basic  logic,  i.e.,  the  algorithm  of  the  primary  operation  must  be  correct.  If 
the  basic  logic  is  incorrect,  the  tests  will  not  operate  correctly  under  any 
condition.  We  then  infer  that  the  problem  with  retrieve-common  must  be  related 
to  the  defective  implementation  of  some  data  structures  or  functions  for  the 
algorithm.  However,  these  data  structures  and  functions  are  shared  among 
several  system  processes.  Any  change  will  affect  the  interrelationship  of  the 
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system  processes.  Additionally,  the  primary  operations  use  other  primary 
operations  for  its  own  operation  For  example,  the  retrieve-common  uses  the 
primary  operation.  Retrieve,  twice  to  obtain  the  first  and  second  set  of  records, 

i.e.,  source  and  target  files  from  the  database  stores.  These  records  are  then 
manipulated  by  retrieve-common  in  order  to  provide  the  correct  result. 

2.  Placing  Debugging  Flags 

The  complexity  of  process  interrelationships  in  MBDS  requires  us  to 
narrow  our  focus  on  the  problem  area  quickly.  This  is  achieved  by  using 
compilable  debugging  flags  to  determine  which  processes  have  been  involved  in 
the  primary  operation,  retrieve-common.  These  flags  provide  information 
regarding  the  variables  passed,  and  messages  sent  by  these  involved  processes. 

The  use  of  these  compilable  flags  is  also  instrumental  in  determining  the 
sequence  in  which  various  processes  and  primary  operations  are  used  to  complete 
their  assigned  tasks.  Once  the  debugging  flags  have  been  compiled  in  place,  a 
retrieve-common  test  run  is  initiated  with  a  database  size  that  is  known  to  allow 
the  operation  to  complete  correctly.  This  test  run  allows  us  to  identify  all  the 
functions,  processes,  and  programs  involved. 

3.  Identifying  File  Locations 

The  flags  are  not  capable  of  indicating  the  locations  of  the  files  in  which 
these  functions,  processes,  and  programs  are  stored.  And  since  there  are  over 
100  such  files  for  MBDS,  this  limitation  must  be  overcome. 

The  search  mechanism  in  the  operating  system  is  ineffective,  because 
the  MBDS  file  structure  is  formatted  in  several  layers  of  abstractions.  These 
layers  of  abstractions  require  that  a  search  request  is  implemented  at  a  specific 
layer  in  order  to  obtain  the  correct  result.  We  observe  that  documentation  tools 
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are  needed  to  allow  the  determination  of  file  and  function  information  more 
efficiently.  In  a  later  chapter  these  documentation  tools  will  be  described. 

4.  Determining  the  Threshold  of  failure 

The  next  step  is  to  initiate  the  retrieve-common  with  a  database  large 
enough  to  cause  the  primary  operation  to  fail.  Since  this  database  size  is  not 
known,  numerous  operational  tests  are  required.  The  operation  fails  when  it  is 
operated  on  a  database  of  45  records  with  an  average  size  of  32  bytes  per  record. 

Before  the  system  fails,  it  provides  a  trace  of  processes  and  functions 
that  have  been  entered  and  exited  via  debugging  flags. 

5,  Using  Error  Feedbacks 

Wherever  there  is  an  abnormal  shutdown  of  MBDS,  a  pool  of  error 
indicators  is  presented  in  the  error-feedback  system  of  MBDS.  The  error- 
feedback  system  provides  an  outlet  for  error  indicators  and  messages  from  the 
operating  system  and  MBDS.  It  consists  of  six  permanent  files.  Each  is  assigned 
to  a  process  of  the  MBDS.  When  MBDS  is  running,  these  files  allow  for  the 
insertion  of  debugging  data,  error  indicators,  and  diagnostic  messages.  A 
number  of  such  data,  indicators  and  messages  are  discussed  herein.  The  first 
type  of  error  message  in  the  feedback  system  is  usually  of  a  message-header 
error.  The  message-header  error  indicates  that  somewhere  in  the  system  a 
message  is  sent  with  a  defective  message-header.  The  defective  message-header 
has  caused  the  message  to  be  undeliverable  and  initiated  the  operating  system  to 
suspend  the  message-sending  processes.  Once  the  running  process  is  suspended, 
the  operating  system  generates  the  error  message  that  has  been  placed  in  the 
appropriate  file  for  the  process.  This  type  of  error  message  is  termed  illegal 
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ioctrl.  After  reviewing  it,  we  determine  that  this  type  of  error  is  sufficient  to 
cause  the  MBDS  system  to  experience  an  abnormal  system  shutdown. 

Another  type  of  error  indicator  is  also  caused  by  the  defective  retrieve- 
common.  This  indicator  suggests  that  system  malfunctions  have  occurred  outside 
of  the  system.  One  indicator,  bus  error .  for  example,  may  be  due  to  too  many 
processes  being  concurrently  executed  by  the  operating  system.  Although  the 
Berkeley  4.3  Unix  Operating  System  has  the  ability  to  conduct  concurrent 
processing  IRef.  6),  there  is  a  limit  on  the  number  of  processes  the  operating 
system  can  manipulate  concurrently.  The  bus  error  can  imply  that  this  limit 
has  been  reached  and  that  the  operating  system  needs  to  notify  the  user.  The 
operating  system  then  suspends  all  running  processes,  places  the  error  message  in 
the  appropriate  file,  and  directs  the  abnormal  shutdown  of  the  MBDS  system. 

Consider  a  third  type  of  error  caused  by  the  defective  retrieve- 
common.  the  m rite  error.  This  error  message  usually  indicates  that  the  system 
has  attempted  to  write  to  an  external  storage  device  that  is  full  or  not  available. 
For  writing,  the  operating  system  provides  an  interface  between  the  disk  and  the 
user  as  shown  in  the  five  steps  below  [Ref.  6]: 

a.  The  operating  system  allocates  a  buffer  to  accept  the  data  provided  by  the 

user  or  user  process. 

b.  The  operating  system  determines  a  location  on  the  external  storage  device 

to  place  the  information  as  indicated  by  the  user  or  user  process. 

c.  The  operating  system  requests  the  controller  of  the  external  storage  device 

to  read  the  contents  of  the  physical  block  into  the  system  buffer. 

d.  The  operating  system  copies  the  contents  in  the  input/output  buffer  of  the 

user  or  user  process  to  the  appropriate  portion  of  the  system  buffer. 


e.  Finally,  it  writes  the  system-buffer  block  back  to  the  external  storage 
device. 

The  write  error  indicates  that  there  is  an  error  in  one  of  the  preceding  steps. 
As  with  other  errors  already  mentioned,  this  error  will  cause  the  operating 
system  to  terminate  the  system  processes  of  MBDS. 

The  myriad  of  errors  has  compounded  our  search  for  the  cause  or 
causes  of  the  defective  retrieve-common.  The  dissimilarity  of  these  errors  have 
not  related  them  to  one  particular  problem.  Additionally,  because  each  of  the 
errors  has  caused  the  system  to  terminate  abnormally,  the  cause  of  that 
termination  could  not  be  traced  in  real-time  to  a  single  function  or  process. 

C.  SIX  THEORIES  ON  DEFECTS 

The  inability  of  error  messages  to  direct  us  to  a  definitive  system  defect  has 
led  to  the  development  of  separate  theories  based  on  the  available  information  on 
hand;  which  includes  usage  patterns,  test  results,  debugging  flags,  and  error 
messages.  Individually,  these  factors  could  not  provide  any  assistance;  however, 
when  combined  some  portions  of  the  problem,  they  may  become  visible.  The 
culmination  of  debugging  information  allows  us  to  develop  six  plausible  theories 
regarding  the  defective  operation  of  the  retrieve-common.  Two  of  these  theories 
are  related  to  the  communication  aspects  of  the  MBDS  system;  three  theories  are 
related  to  data  manipulation  by  MBDS;  the  last  one  is  related  to  the  operating 
system.  These  theories  are  presented  below: 

1 .  Defects  in  Communication 

The  retrieve-common  requires  processor  communications  in  broadcast 
mode.  This  mode  of  communications  has  resulted  in  many  message-header 
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errors  which  leads  us  to  propose  the  possibility  of  two  communication  related 
errors: 

a.  There  mas  be  a  MBDS  design  limitation  on  the  size  of  the  message  being 

broadcasted.  Therefore,  the  system  fails  if  the  size  of  the  message  grows 
beyond  the  limit. 

b.  An  operating-system-interface  problem  may  exist.  The  retrieve-common 
may  require  different  sockets  to  be  utilized  during  different  activities, 
thus  causing  the  possibility  of  a  socket-related  error.  The  socket-related 
error  would  provide  a  header  error  from  the  operating  system.. 


2.  Defects  in  System  Processes 

Since  the  write  errors  point  to  possible  defective  interfaces,  the  problem 
area  may  be  narrowed  by  initially  reviewing  the  following: 


a.  PP  (i.e.,  the  postprocessing  process)  for  the  output  combined  records  of 
the  retrieve-common  in  the  controller  computer. 


b.  The  disk  I/O  process  for  base  data  (i.  e.,  both  the  source  and  target  files) 
in  retrieve-common's  record-processing  process. 


c.  The  hashing  process  for  storing  the  source  file  of  the  retrieve-common 
in  virtual  memory  temporarily. 


3.  Defects  in  Operating  System  Supports 

As  discussed  earlier  in  the  chapter,  a  bus  error  is  related  to  the  number 
of  active  processes  in  the  operating  system.  The  possibility  that  the  number  of 
active  processes  surpassing  the  limit  designed  into  the  operating  system  is  small. 


D.  THE  STRATEGY  FOR  EVALUATING  THE  THEORIES 


The  capability  of  the  system  to  operate  correctly  with  very  small  databases 
marks  the  possibility  of  a  defect  in  MBDS  processes.  The  three  theories  of 
defects  in  system  processes  are  therefore  pursued  first.  The  broadcast 
communications  are  built  on  the  protocols  of  the  Ethernet.  They  are  the  next 
place  to  look  for  defects.  Thus,  the  two  theories  on  communications  defects  are 
considered  next. 

Operating  system-related  errors  are  the  least  plausible.  The  ability  of 
retrieve-common  to  spawn  an  abnormal  number  of  processes  is  very  small. 
Therefore,  this  theory  is  to  be  researched  last.  In  this  way,  the  theories  w  ith  the 
most  promising  defect  detection  and  corrections  ideas  am  applied  to  the  problem 
first. 
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III.  DETECTIONS  AND  CORRECTIONS  OF  DEFECTS 


In  Chapter  II  we  have  developed  various  theories  for  the  possible  defect  in 
retrieve-common.  We  now  apply  these  theories  to  the  detection  and  correction 
of  defects  found  in  the  retrieve-common. 

A.  A  REDUCTION  OF  THE  NUMBER  OF  PROCESSES  TO  BE 

ANALYZED 

All  of  the  error  indicators  resulting  from  our  testing  enable  us  to  conclude 
that  certain  parts  of  the  system  are  operating  correctly.  Therefore,  we  are  able 
to  reduce  the  number  of  processes  that  may  have  defective  operations. 
Specifically,  with  the  exception  of  the  communication  and  record  processing 
processes  (i.e.,  GET,  PUT.  and  RP).  we  conclude  that  all  of  the  other  backend 
processes  are  operating  correctly.  Since  the  directory-management  and 
concurrency-control  processes  (i.e.,  DM,  and  CC)  are  operating  correctly  during 
the  primary  operations  of  inserts,  deletes,  and  retrieves,  they  should  continue  to 
operate  correctly  in  supporting  the  retrieve-common. 

We  also  tested  the  controller  processes.  We  are  able  to  conclude  that  the 
insert-information-generator  and  the  request-processing  processes  are  operating 
correctly  (IIG  and  TP).  Specifically,  in  IIG  the  placement  of  clustered  records 
in  the  database  stores  is  being  conducted  correctly;  TP  is  operating  correctly  for 
all  other  primary  operations  where  all  requests  are  properly  identified, 
formatted  and  transmitted  correctly. 

Nevertheless,  we  must  examine  the  five  processes  TP,  RP,  PP,  GET  and 
PUT  more  thoroughly,  since  they  support  the  retrieve-common  operation.  We 
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run  the  identical  retrieve-common  with  two  different  database  sizes  one  of  which 
causes  a  system  failure.  This  test  indicates  that  the  identical  primary  operation 
formatted  by  TP  has  operated  correctly  on  a  smaller  database  size  only.  Thus, 
this  test  pros  ides  the  necessary  evidence  that  the  formatted  request  provided  by 
TP  may  not  be  a  factor  to  the  system  failure.  Perhaps,  the  system  has  failed  due 
to  other  factors  contributed  by  other  processes  in  handling  the  larger  database 
size. 

Some  other  controller  processes  can  not  be  discounted  as  error-free.  For 
instance,  there  is  some  evidence  from  the  error  indicators  that  a  possible  defect 
may  exist  in  the  communication  processes.  Get  and  Put.  which  are  to  be  discussed 
in  this  chapter.  A'*  the  larger  size  of  the  database  effected  the  system 
performance,  the  handling  of  the  large  amount  of  results  by  PP  may  be  the  cause 
of  errors  too.  Finally,  the  backend  process  RP  which  accesses  individual  records 
of  a  large  database  has  shown  many  error  indications.  We  should  examine  it 
thoroughly  in  the  context  of  large  database  sizes. 

Of  the  five  processes  we  have  mentioned  above,  four  may  cause  the  retrieve- 
common  to  be  defective.  These  four  are  PP,  RP.  GET  and  PUT;  their  testing 
and  evaluation  in  the  context  of  large  databases  are  presented  in  the  later  sections 
of  this  chapter. 

B.  THE  IDENTIFICATION  OF  DOCUMENTATION 
REQUIREMENTS  FOR  DEBUGGING 

In  maintaining  and  debugging  a  complex  system  such  as  MBDS.  the  system 
documentation  is  critical.  Effective  documentation  assists  in  the  efficient 
determination  of  how  a  given  process  performs.  Additionally,  with  the 
documentation,  modifications  can  be  made  to  the  process  at  appropriate  places. 
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The  documentation  that  is  necessary  to  evaluate  the  MBDS  processes  can  be 
considered  at  three  level?*  of  detail: 

a.  Process  Map  -  This  documentation  is  developed  for  each  of  the  system 

processes  (RP,  CC,  DM.  etc.  ).  It  provides  a  high-level  view  of  what 
events  are  accomplished  and  when  a  particular  process  is  activated.  It 
presents  which  procedures  are  called,  what  purposes  are  intended,  and 
where  files  of  the  source  code  are  located. 

b.  Process  Pseudo-Code  -  This  documentation  is  also  developed  for  each 
of  the  system  processes.  It  provides  a  short  description  of  the  tasks 
completed  by  those  procedures  which  have  been  highlighted  in  the 
Process  Map.  The  Process  Pseudo-code  does  not  provide  detailed 
information  on  how  procedures  complete  their  tasks. 

c.  Transaction  Flow  -  This  document  explains  the  events  involved  with 

specific  procedures,  and  a  detailed  transaction  flow  is  developed.  This 
transaction  flow  represents  the  succession  of  events  involved  in  a 
particular  subprocess  or  procedure.  This  documentation  is  presented  in 
flowcharts,  w  hich  illustrate  the  logic  of  a  specific  procedure. 

Appendices  A.  B,  and  C  provide  excerpts  of  the  above  three  levels  of 

documentation.  These  excerpts  should  be  used  as  a  documentation  guide  for 

system  developers.  The  availability  of  three  levels  of  documentation  allows 

system  users  and  staff  to  select  the  level  of  documentation  they  require  to 

complete  there  task. 

C.  ASSESSMENTS  OF  THEORIES  OF  DEFECTS 

With  three  levels  of  documentation,  we  now'  proceed  to  apply  our  theories  of 
defects  to  the  detection  and  correction  of  the  retrieve-common  operation. 

1.  Communication-Related  Theories  of  Defects 

In  Chapter  II.  we  have  presented  two  communication-related  theories  of 
defects.  The  first  theory  suggests  that  messages  in  the  transmission  during 
retrieve-common  may  be  limited  in  size.  The  defective  performance  that  occurs 
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at  larger  database  sizes  may  be  related  to  an  inability  of  GET  or  PUT  to  handle 
messages  after  these  messages  have  surpassed  a  fixed  message  size.  To  validate 
this  theory  a  review  of  the  message  structures  involved  in  transmitting  messages 
in  the  retrieve-common  is  conducted. 

The  primary  operation,  retrieve-common,  transmits  and  receives  only 
one  message  (Bucketlnfo)  specific  to  the  operation.  This  message  delivers  the 
target  records  of  a  particular  backend  to  the  other  backends.  The  Bucketlnfo 
message  is  a  formatted  message  that  uses  a  fixed  header.  The  header  is 
computed  and  formed  during  the  insertion  of  records  into  the  message  buffer,  i. 
e.,  the  message  development.  While  reviewing  the  message  development,  we 
note  that  the  record  addresses  in  the  header  are  static  and  not  modifiable.  Each 
backend  transmits  its  Bucketlnfo  message  with  the  same  header  format.  The 
format  of  this  message  is  presented  in  Appendix  D. 

Now,  we  apply  our  first  theory  of  communication-related  defects. 
Specifically,  the  theory  is  that  a  message  routing  error  is  caused  by  the  header 
error  of  the  message.  A  routing  error  could  only  occur  if  the  message 
transmitted  by  retrieve-common  uses  a  variable  format  for  its  addresses. 

Since  the  message  transmitted  by  the  retrieve-common  is  indeed  static  in 
its  header  format,  this  theory  is  not  possible.  The  message  header  for  any 
individual  message  is  transmitted  with  the  identical  header  format.  No  header 
adjustments  are  made  due  to  subsequent  changes  in  the  database  size,  since  the 
subsequent  data  are  transmitted  in  subsequent  messages.  W'hen  a  block  of 
records  are  required  to  be  transmitted,  the  same  header  format  for  their 
addresses  is  used.  Thus,  the  message  header  is  constructed  in  the  same  fashion. 
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The  next  theory  is  whether  or  not  the  Bucketlnfo  message  can 
accommodate  an  excessive  message  size.  The  buffer  for  the  Bucketlnfo 
message  is  filled  with  records  by  using  a  standard  looping  mechanism  which 
contains  a  record  counter,  K.  This  record  counter  is  used  to  keep  track  of  the 
number  of  records  inserted  into  the  buffer.  Additionally,  a  byte  counter,  i,  is 
uxed  to  determine  the  length  of  all  the  records  presented  for  transmission  to 
other  backends.  This  byte  counter  is  used  in  conjunction  with  K  to  determine 
whether  or  not  there  is  enough  buffer  space  for  the  incoming  records.  If  there 
is  not,  Bucketlnfo  message  is  then  transmitted  to  a  exception  procedure  of  the 
operating  system. 

The  capability  of  retrieve-common  to  properly  fit  the  incoming  records 
into  the  message  buffer,  even  though  it  has  a  fixed  size  of  1400  bytes,  illustrates 
this  implementation  is  database-size  independent.  We  therefore  discount  the 
theory  that  the  size  of  the  message  buffer  in  retrieve-common  is  implemented  in 
a  fashion  that  will  allow  the  system  to  fail  due  to  overloading  of  the  buffer  with  a 
large  number  of  records. 

The  third  communication-based  theory  suggests  a  defect  exists  in  the 
retrieve-common's  utilization  of  the  communication  protocols  supplied  by  the 
operating  system.  A  brief  explanation  of  the  communication  protocol  is 
necessary.  The  operating  system  used  by  MBDS  provides  two  different  methods: 
the  reliable  and  unreliable  datagram.  Stream  communications  are  via  sockets 
w  hich  are  named  locations  in  a  process.  When  a  process  wants  to  send  a  message 
to  another  process,  it  refers  to  the  name  of  the  socket  in  the  other  process  and 
transmits  the  message  to  the  named  socket.  The  operating  system  insures  the 
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communication  is  reliable  and  error-free.  This  type  of  communication  is  one-to- 
one  communication,  i.e.,  from  one  computer  to  another  computer. 

Datagram  communications  allow  a  message  to  be  transmitted  from  one 
process  to  several  processes.  This  is  known  as  one-to-many  communications, 
i.e.,  broadcasting.  However,  the  datagram  communication  is  not  reliable,  i.e.. 
occasionally  one  of  receiving  processes  does  not  get  the  messages.  Thus,  it  is 
unreliable  broadcasting. 

The  method  of  communications  in  MBDS  is  reliable  broadcasting  based 
on  the  use  of  reliable  sockets  and  unreliable  datagrams  for  interprocess 
communication.  A  message  is  always  broadcasted  first  via  the  datagram 
communication  to  all  the  other  processes.  If  some  rec  ving  processes  have  not 
acknowledged  the  receipt  of  the  message,  the  message  is  retransmitted  to  them 
via  their  sockets.  A  key  aspect  of  this  retransmission  is  that  the  sockc  names  are 
never  changed,  and  new  sockets  are  not  established  during  the  retrieve-common. 
Thus,  the  broadcast  mode  of  transmission  in  retrieve-common  is  reliable  and 
fail-safe.  The  discounting  of  the  last  communication  related-theory  allows  us  to 
begin  the  evaluation  of  other  theories. 

2.  Storage-Related  Theories  of  Defects 

To  identify  storage-related  defects,  we  first  review  storage  structures 
used  in  the  testing  of  the  retrieve-common.  The  first  storage  structure  test  1  is 
the  buffer  structure  in  postprocessing.  It  may  be  implemented  without  the 
capability  to  handle  large  amounts  of  data.  Additionally,  it  may  not  provide  a 
unique  buffer  for  the  results  of  the  retrieve-common.  If  these  are  indeed  the 
cases,  then  they  may  indicate  why  the  retrieve-common  cannot  output  large 
amounts  of  data. 


Our  analysis  has  determined  that  there  is  only  one  designated  output 
buffer  for  MBDS.  Retrieve-common  does  not  provide  its  own  output  buffer. 
We  then  direct  our  analysis  to  this  buffer.  The  buffer  is  implemented  as  an 
array  of  characters  with  a  maximum  size  of  1400  bytes.  The  procedure 
determines  the  amount  of  space  available  in  the  buffer  and  loads  the  empty  space 
w  ith  records  waiting  to  be  output.  To  empty  the  buffer,  the  procedure  passes  the 
contents  of  the  buffer  via  a  message  directly  to  the  user-interface. 

We  al  so  find  that  MBDS  utilizes  the  same  procedure,  storage  structure, 
and  buffer  to  provide  output  to  the  user  interface  for  all  the  other  primary 
operations.  This  review  invalidates  our  theory  that  either  the  storage  structure 
of  the  postprocessing  buffer  or  the  procedure  in  postprocessing  the  buffered 
records  is  defective. 

The  conclusion  that  the  output  structure  is  implemented  correcdy  has 
led  us  to  review  the  correctness  of  input  structures.  Input  structures  deal  with 
storage  structures  of  data  coniine  from  secondarv  storage  devices  such  as  the 

V-  V.  *  c. 

paging  disk.  Retrieve-common  requires  that  every  record  of  the  source  file 
satisfied  be  entered  into  the  virtual  memory.  If  the  size  of  the  source  file  is  large 
more  virtual  memory  would  be  required.  As  with  any  secondary  storage  device, 
limitations  do  exist  on  the  number  of  source  records  the  paging  disk  may 
support.  Also  the  paging  disk  is  smaller  than  the  base-data  disk  of  a  backend. 
The  possibility  of  a  paging-disk  overflow  is  considered  here.  Additionally,  this 
analysis  allow  s  us  also  to  review  the  implementation  of  the  input  buffer.  There 
may  be  a  defect  in  the  input  buffer  as  well. 

The  new  disk  input  and  output  (disk  i/oi  function  is  implemented  to 
overload  the  paging  disk  by  reaching  the  user's  limit  on  base-oata  store  known  as 


Quota;  which  contains  allocated  disk  storage  for  the  base-data  of  a  particular 
user.  The  disk  i/o  function  reads  an  entire  track  from  the  base-data  disk  into  the 
Track-Buffer.  The  Track-buffer  is  implemented  as  an  one  dimensional 
array  of  12.800  characters  which  is  the  size  of  a  track  also.  When  the  disk  read 
is  completed,  the  contents  of  the  Track-buffer  are  verified.  To  ensure  records 
retrieved  from  the  base-data  disk  do  not  exceed  the  capability  of  MBDS  to 
process  them,  all  of  the  contents  in  the  Track-buffer  are  processed  prior  to 
reading  another  track  of  records.  This  processing  consists  of  the  verification  of 
records  based  on  the  query  and  hashing  the  appropriate  records  into  the  virtual 
memory  for  later  merging.  In  other  words,  this  procedure  ensures  that  the  large 
amounts  of  data  on  the  base-data  disk  do  not  overrun  the  buffer  space.  More 
importantly,  the  data  can  be  processed  one  track  at  a  time. 

The  ability  to  control  input  rates  from  the  database  stores  has  provided 
us  with  the  evidence  that  the  disk  i/o  process  is  not  the  cause  of  the  system's 
defect.  Therefore,  we  remove  the  disk-storage-related  theory  of  defects  from 
further  consideration. 

The  final  storage-related  theory  of  defects  to  be  reviewed  is  the  theory' 
of  the  virtual-memory  inputs/outputs.  Even  though,  the  track-buffer  and  the 
disk  i/o  process  ensure  positive  control  of  information  input,  they  fail  to  account 
for  information  retrieved  from  other  sources.  Each  backend  has  the  capability 
to  transmit  a  message  up  to  9200  bytes.  To  process  the  message,  the  backend 
must  store  it  in  the  virtual  memory  which  may  overload  the  paging  disk. 

The  virtual-memory  i/o  process  is  used  in  the  retrieve-common.  Its 
goal  is  to  provide  efficient  temporary  storage  of  records  received  from  other 


sources  in  the  virtual  memory.  Our  analysis  is  focused  on  the  virtual  memory 
i/o  process. 

a.  Hashing  and  Storage  of  Records 

The  retrieve-common  begins  with  TP,  i.e.,  the  Request  Processing 
process.  In  this  process,  the  type  of  query  is  identified,  formatted,  and 
transmitted  to  the  backends.  In  Appendix  C,  we  provide  a  review  of  the  specific 
subprocedures  involved  in  this  process.  The  following  high-level  summary  of 
procedures  is  provided  prior  to  our  determination  of  the  problem. 

The  retrieve-common  differs  from  the  other  primary  operations  after 
the  disk  i/o  process  is  completed.  The  following  steps  of  the  retrieve-common 
operation  also  indicate  where  the  difference  occurs: 


Step  1.  Allocate  space  in  the  virtual  memory  to  store  information  about  the 
primary  operation. 

Step  2.  The  directory  management  process  provides  a  list  of  addresses  of 
tracks  that  contain  records  likely  to  satisfy  the  query.  Each  of  these 
tracks  is  fetched  from  the  base-data  disk  and  placed  into  the  virtual 
memory,  i.e.,  the  Track-buffer. 

Step  3.  The  records  in  the  track  buffer  are  examined  one  record  at  a  time.  If 
the  record  is  marked  for  deletion  or  does  not  satisfy  the  query,  it 
will  be  discarded.  If  the  record  does  satisfy  the  query,  appropriate 
attribute  values  are  extracted.  The  record  is  placed  in  an  result 
buffer. 

Step  4.  This  is  where  retrieve-common  differs  from  all  the  other  primary 
operations.  When  the  result  buffer  is  full,  the  extracted  attribute 
values  of  records  in  the  buffer  are  sent  to  a  function  HashFunc, 
which  provides  the  virtual  memory  addresses  and  temporary’  storage 
of  these  records.  This  function  is  unique  to  the  primary  operation. 

Step  5.  Steps  2,  3  and  4  are  repeated  until  all  of  the  addresses  provided  by 
the  directory  management  process  are  processed,  the  tracks  at  these 
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addresses  accessed,  and  the  records  satisfying  the  query  hashed  into 
the  virtual  memory. 

It  is  important  to  note  that  these  five  steps  are  designed  for  the  source 
query.  They  are  not  duplicated  for  the  target  query;  since  records  satisfying  the 
target  query,  although  hashed,  are  not  stored  temporarily  in  the  virtual  memory, 
i.e.,  records  whose  different  attribute  values  are  hashed  into  the  same  virtual 
memory  address,  as  those  in  Step  4.  Our  analysis  of  the  hashing  function  will 
begin  in  Step  4.  The  process  of  hashing  records  into  the  virtual  memory 
requires  the  process  to  extract  the  common  attribute  value  of  a  record  from  the 
result  buffer,  to  develop  a  virtual  memory  address  confined  within  the  hashed 
address  space,  and  to  place  the  attribute  value  and  record  address  in  the  hashing 
table.  In  addition  to  these  capabilities,  the  process  also  resolves  any  collision. 
This  ability  is  based  on  a  chaining  method  where  colliding  records  ,  i.e.,  records 
whose  different  attribute  values  are  hashed  into  the  same  virtual  memory 
address,  are  linked  together. 

In  Appendix  D,  we  provide  an  transaction  flow  of  the  steps  involved  in 
the  determination  of  virtual  memory  addresses  of  records  of  the  transaction.  We 
only  address  those  steps  here  wTere  there  are  defects. 

The  original  hashing  algorithm  is  presented  below: 

Step  1 :  Extract  the  common  attribute  value  (attr- value)  from  a  record  in  the 
result  buffer. 

Step  2:  If  the  syntactic  type  of  attr-value  is  of  the  string  type,  then  place  the 
first  two  characters  of  attr-value  in  the  temporary  variables  cl  and 
c2.  Otherwise,  designate  attr-value  as  a  number,  and  assign  to  a 
temp  variable. 
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Step  3:  Calculate  the  bucket  number.  If  attr- value  is  a  string  and  the  second 
character  is  <  =  48  and  =  0.  the  bucket  number  is  (cl  -  65)  *  36.  If 
c2  >  48,  the  bucket  number  is  ((cl  -65)  *  36)  +  (c2  -  48).  If  c2  > 
greater  than  but  not  equal  to  48.  then  bucket  number  is  calculated  as 
((cl  -  65)  *  36  )+  (c2  -  97)  +  10. 

Step  4.  If  attr-vaiue  is  a  small  integer,  2,  the  bucket  number  would  be  attr- 
value  -  0. 


Step  5  If  attr-value  is  a  large  integer,  3,  the  bucket  number  is  (attr-vaiue 
-0)/. 61 

Step  6  This  bucket  number  and  record  will  be  input  into  a  temporary  buffer 
and  the  common  attribute  of  the  next  record  is  processed  in  Step  2. 

The  above  algorithm  failed  to  fulfill  the  two  premises  of  hashing: 
randomness  and  uniformity  [Ref.  8].  A  good  hashing  function  transforms  a  set 
ot  keys,  i.e.,  common  attribute  values,  to  a  set  of  random  locations  uniformly 
distributed  in  the  range  of  hash  table  [Ref.  9]. 

The  present  hashing  algorithm  fails  to  randomly  disperse  records  when 
the  first  two  characters  of  the  common  attribute  value  are  the  same  and  of  the 
string  type.  For  example,  given  the  following  two  customer  codes.  Cl 02  and 

C103  as  common  attribute  values,  the  algorithm  will  compute  them  as  follows: 
For  Cl 02.  (67  -65)  *  36  =  72  (bucket  number) 

For  Cl 03,  (67  -65)  *  36  =  72  (bucket  number). 

Each  of  them  would  furnish  the  same  bucket  number,  i.e.,  virtual  address,  to 
place  their  respective  records. 

Although  this  example  only  shows  the  lack  of  randomness,  the  other 
deficiency,  lack  of  uniformity,  is  illustrated  by  the  way  the  algorithm  uses  a 
calculation  that  is  different  from  the  one  used  on  string  values.  For  example, 
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given  the  following  two  customer  codes,  835  and  916  as  common  attribute 
values,  the  algorithm  will  compute  them  as  follows: 

For  835,  835  -  0  =  835  (  bucket  number). 

For  916,  916-0  =  916(  bucket  number). 

Therefore,  the  determination  of  virtual  addresses  for  records  is  based  on  two 
separate  calculations. 

The  collision  resolution  technique  is  reviewed.  The  hashing  function 
ensures  that  each  of  the  8192  buckets  in  the  hash  table  serve  as  the  head  of  a  link 
list  of  blocks.  When  a  block  of  the  bucket  has  reached  its  limit  of  1000  bytes,  a 
operating-system  call,  alloc,  is  made  for  more  memory  in  order  to  construct  a 
new  block.  The  new  block  is  then  filled  with  the  wait  g  record.  If  the  original 
block  has  not  reached  its  capacity,  the  new  record  is  inserted. 

This  type  of  collision  handling  is  effective,  if  it  is  used  in  conjunction 
with  a  hashing  function  that  ensured  uniformity  and  randomness  [Ref.  8],  The 
ideal  uniformity  will  be  that  each  link  list  of  blocks  has  the  same  number  of 
collided  records.  Additionally,  the  effective  randomness  will  keep  the  number  of 
collided  records  in  the  link  list  small.  If  an  uniform  distribution  of  records  does 
occur,  the  hash  table  and  the  bucket  size  allows  for  approximately  245.000,  32- 
byte  records  to  be  stored  before  any  collision  takes  place. 

However,  uniform  distribution  does  not  occur  in  most  instances.  The 
hashing  function  allows  for  the  worst  possible  distribution  to  occur,  i.e.,  the 
hashing  of  every  common  attribute  value  to  the  same  bucket.  Thus,  the  insertion 
or  searching  operations  has  the  same  level  of  performance  as  a  linear  search 
method  which  is  inefficient  for  the  hashing  function. 
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b.  Defects  in  Hashing 

With  the  evidence  that  the  hashing  algorithm  is  defective,  we  then 
determine  what  is  the  impact  on  the  MBDS  system.  We  find  the  separate 
chaining  technique  in  collision  handling  correlates  with  the  message-header  and 
buffer-error  indicators  received  in  our  test  runs.  Also,  we  find  that  the  time 
allocation  is  important  to  the  well-being  of  the  retrieve-common. 

The  collision  handling  using  the  separate  chaining  technique  is  noted  for 
its  capability  to  grow  as  a  link  list  as  long  as  needed.  However,  this  growth  is 
mediated  by  the  memor>  availability.  The  capability  of  the  present  file  system  to 
provide  the  memory  necessary  to  maintain  the  growth  of  the  link  list  is 
questionable.  The  file  system  allows  for  the  segmentation  of  memory  into 
variable  sizes  [Ref.  7].  Additionally,  the  amount  of  memory  allocated  to  a 
particular  retrieve-common  cannot  be  dynamically  increased.  Therefore,  a  very 
large  set  of  records  from  both  the  source  and  target  files  can  run  out  of  memory. 

The  memory  size  for  the  buckets  of  a  retrieve-common  is  too  small. 
During  an  operational  test  that  requires  large  sizes  of  data  to  be  hashed  into  the 
virtual  memory,  a  write  error  is  observed.  Tins  error  is  a  direct  result  of  the 
fact  that  the  retrieve-common  has  used  up  its  allotted  partition  [Ref.  7].  Using 
software  monitors,  we  dynamically  observed  the  dedication  of  available  memory 
to  processes  performing  tasks  for  the  retrieve-common.  A  utilization  level  of 
approximately  99  percent  has  been  observed  moments  before  the  MBDS  system 
is  shut  down. 

With  the  evidence  that  the  defective  hashing  algorithm  is  the  cause  of 
shutdown,  we  work  to  correct  the  defect.  The  revised  hashing  algorithm  is 
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designed  to  provide  randomness  and  uniformity  which  are  lacking  in  the  original 
algorithm. 

I).  A  NEW  HASHING  ALGORITHM 

We  first  ensure  that  the  new  algorithm  is  applicable  for  all  possible  key 
types,  i.e.,  all  possible  value  types  of  the  common  attribute. 

The  technique  consists  of  transforming  every  character  of  the  common 
attribute  value  to  its  internal  representation  i.e.,  an  ASC  II  integer  [Ref.  10]. 
The  sum  of  all  the  characters  of  the  common  attribute  values  (called  x)  is  now 
presented  to  the  hashing  function.  An  example  of  this  new  technique  is 
illustrated  below: 

For  C102,  we  have  C  =  67,  1  =  49,  0  =  48,  and  2  =  50. 

Thus,  x  =  67  +  49  +  48  +  50  =  214. 

The  randomness  of  our  hashing  function  is  provided  by  the  division  method 
[Ref.  7|.  This  method  is  defined  as  H(x)  =  x  mod  m  +  1.  where  m  is 
preferable  a  prime  and  x  is  the  same  as  defined  above.  This  computation 
basically  provides  the  remainder  of  the  division  of  x  by  m.  The  remainder  plus 
one  is  the  virtual-memory  address. 

The  division  method  is  used  because  it  insures  an  address  within  the  size.  m. 
of  the  hashing  table.  Additionally,  the  division  method  ensures  that  if  the  table 
size  is  a  large  prime  number,  any  collision  of  common  attribute  values  is 
uncommon  [Ref.  8|.  For  example,  given  x  with  a  value  of  214  and  a  hashing 
table  w  hose  size,  m,  is  8191  buckets,  the  following  address  calculation  occurs: 

H(x)  =x  mod  m  +1 
H  ( 2 1 4 )  =  214  mod  8191  +  1 
=  215. 
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The  new  hashing  algorithm  is  presented  below  : 


Step  1.  Extract  the  common  attribute  value  (attr-valuc)  from  the  record  in 
the  result  buffer. 

Step  2.  Transform  each  character  of  attr-value  to  its  internal  ASC-II 
representation. 

Step  3.  Calculate  the  sum  (temp)  of  their  ASC-II. 

Step  4.  Conduct  the  modulo  division  on  temp.  The  resulting  remainder  plus 
one  is  the  hashing-table  entry. 

Step  5.  The  record  is  directed  to  the  virtual  memory  storage  via  the 
appropriate  hashing-table  entry. 

The  operational  testing  of  the  new  hashing  algorithm  indicate  that  the 
hashing  errors  of  the  original  algorithm  have  disappeared.  In  addition,  the  new 
hashing  function  provides  variable  buckets  which  are  absent  in  the  original 
function. 

E.  AN  UNFORSEEN  COMMUNICATION-RELATED  DEFECT 

An  unforeseen  error  is  discovered  while  conducting  testing  on  the  retrieve- 
common  with  large  databases.  This  error  is  directly  related  to  the  operations  of 
MBDS  backends. 

We  recall  the  that  retrieve-common  requires  each  backend  to  transmit  their 
target  records  to  the  other  backends.  A  message  transmission  error  occurs 
during  this  transmission.  We  observe  that  no  error  occurs  if  the  message 
containes  all  of  the  records  (  i.e.,  not  segmented).  Additionally,  if  the  portion  of 
the  message  sent  is  the  first  segment  of  several  message  segments,  the  message  is 
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error-free.  An  error  occurs  if  the  message  has  not  met  either  of  these  two 
conditions. 

The  message  error  occurs  only  when  the  first  27  characters  of  the  message 
body  are  incorrect.  The  attribute  that  is  necessary  to  determine  the  virtual 
address  of  the  record  is  therefore  incorrect.  As  a  result  value,  the  hashing 
function  attempts  to  compute  an  virtual  address  using  an  incorrect  value. 
Incidentally,  the  value  that  the  hashing  function  used  is  always  the  content  of  a 
register  used  in  an  earlier  operation.  The  effect  of  using  16  characters  to 
compute  the  virtual  address  has  led  to  an  address  too  large  for  the  operating 
system  to  handle.  This  excessively  large  address  caused  a  core  dump  and 
immediate  system  shutdown. 

Our  analysis  shows  that  message  timing  is  the  cause  of  the  message-error. 
This  conclusion  is  based  on  an  exhaustive  analysis  of  a  sample  bucket-message 
traffic  during  different  phases  of  transmission.  The  bucket  message  is  reviewed 
( 1 )  before  and  after  transmission  between  processes  in  the  same  backend,  (2) 
prior  to  being  inserted  into  the  operating  system  for  interprocess  communication 
among  backends  via  the  interprocess  communication  (ip)  buffer,  and  (3)  after  the 
receipt  by  the  backends.  The  bucket  message  is  correct  in  all  three  locations 
except  when  it  is  placed  in  tiie  ip  buffer  of  the  operating  system.  The  ip  buffer  is 
an  intermediate  buffer  of  the  operating  system  for  message  transmission 
[Ref.  1 1  [.  However,  though  the  message  goes  into  the  ip  buffer  correctly,  it  exits 
incorrectly. 

The  ip  buffer  has  a  size  of  1000  bytes  [Ref.  11J.  But,  the  size  of  the 
messages  to  be  inserted  into  this  buffer  is  up  to  1425  characters.  With  the  size  of 
the  message  larger  than  the  buffer  size,  we  discover  that  a  flushing  mechanism  is 
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used.  It  ensures  that  as  the  buffer  reaches  it  limit,  it  first  outputs  its  contents  to 
the  appropriate  source  and  then  allows  the  receipt  of  additional  messages.  Our 
tests  indicate  this  mechanism  has  not  been  given  enough  time  to  complete  the 
flushing  task.  When  the  number  of  target  records  to  be  transmitted  require 
multiple  bucket  messages,  the  messages  are  damaged  in  the  ip  buffer. 

The  size  limitation  of  the  ip  buffer  and  its  slow  performance  when 
transmitting  multiple  target  records  point  to  a  message-timing  error.  The  input 
speed  of  messages  entering  the  ip  buffer  is  faster  than  the  speed  that  the  ip  buffer 
can  empty  its  contents  by  sending  out  as  a  message.  These  differences  in 
capabilities  cause  the  messages  in  the  buffer  to  be  affected  by  incoming  records. 
One  expedient  way  to  overcome  this  limitation  is  to  allow  enough  time  for  the 
flushing  mechanism  to  complete  each  flushing  task. 
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IV.  A  SUMMARY  OF  FINDINGS 


A.  DEFECTS  DISCOVERED 

The  retrieve-common  operation  has  not  been  performing  correctly  due  to  a 
communication-related  timing  defects  and  a  defective  hashing  function. 

1.  Causes  of  the  Communication-Related  Defects 

The  communication-related  defects  have  been  caused  by  a  buffer-timing 
error.  The  operating  system’s  communication  buffer  is  unable  to  completely 
flush  its  contents  before  the  arrival  of  the  next  message.  Therefore,  in  some 
instances,  the  contents  of  the  communication  buff  can  be  inadvertantly 
modified  which  provides  the  neccessary  conditions  for  .he  ioctrl  error. 

2.  The  Defects  of  The  Hashing  Function 

The  hashing  function  is  considered  defective  because  it  fails  to  provide 
randomness  and  uniformitv.  In  the  case  of  randomness,  when  the  first  two 
letters  of  the  common-attribute  value  are  the  same,  the  hashing  function 
generates  the  same  virtual  address.  The  lack  of  uniformity  is  evident  when 
different  address  calculations  are  used  for  string  and  numerical  attribute  values. 

The  defect  in  the  hashing  algorithm  is  apparent  when  we  use  large 
databases  which  assign  records  to  the  same  virtual  address.  The  hashing  function 
exhausts  the  user's  memory  allotment  which  leads  to  the  write  error. 

3.  Other  Findings  Concerning  Defects 

The  cause  of  the  bus  error  that  we  observed  during  our  theorizing  stage 
is  now  know  n.  Since  MBDS  is  a  loosely  coupled  system,  the  backends'  operating 
systems  work  independently.  When  an  abnormal  termination  occurs  on  one 
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backend,  it  does  not  automatically  cause  the  termination  of  the  other  backends. 
Processes  which  are  interacting  with  the  backend  that  terminated  may  shutdown, 
but  the  other>  will  not  shutdown.  These  remaining  processes  require  manual 
termination.  This  need  for  manual  termination  can  result  in  the  occurence  of 
duplicate  processes  if  MBDS  is  reactivated. 

MB DS  does  not  allow  duplicate  processes.  Therefore,  the  operating 
system  presents  a  bus  error  when  the  MBDS  system  is  re-activated  and  duplicate 
processes  exist.  This  deficiency  is  corrected  by  developing  a  program  which  w  ill 
shutdown  all  processes  on  the  backends  prior  to  MBDS  reactivation. 

B.  BENEFITS  OF  THIS  RESEARCH  PROJECT 

The  benefits  of  this  research  are  substantial.  They  are  presented  below : 

a.  We  have  determined  that  the  MBDS  process  architecture  is  effective.  The 

location  of  the  merging  functions  takes  advantage  of  the  peculiarities  of 
the  system  network  and  minimizes  delays. 

b.  We  have  developed  and  presented  a  documentation  structure  that  will 
assist  system  designers  and  maintenance  staff  to  design  and  service 
complicated  software.  Examples  of  such  documentation  are  presented  in 
appendices. 

c.  We  have  presented  a  methodology  for  efficient  trouble-shooting  of 
complex  parallel-software  systems.  With  the  increasing  development  of 
parallel  systems,  this  methodology  provides  an  effective  guide  to  system 
staff  w  ho  conduct  system  maintenance. 

d.  We  have  determined  the  causes  of  the  defective  performance  of  the 
Primary  Operation  ,  Retrieve  Common.  W'e  are  able  to  correct  one  of 
the  defects;  the  problematic  hashing  algorithm.  However,  the 
communication-timing  defect  will  require  further  analysis.  The  timing 
analysis  necessary  to  flush  the  ip  buffer  is  beyond  the  scope  of  this  study, 
besides  it  is  a  problem  inherited  in  the  operating  system,  not  the  MBDS 
svstem. 
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e.  Finally,  we  have  corrected  the  file-path  errors  which  adversely  affect  the 
ability  to  develop  test  database?.. 

The  end  result  of  this  research  is  that  the  Primary  Operation,  Retrieve 
Common  that  can  now  manipulate  and  merge  a  database  50(Kr  larger  than  at  the 
outset  of  this  research.  More  importantly,  we  have  provided  an  outline  for  the 
successful  trouble-shooting  of  complex  parallel  systems. 

C.  I  L  IT  RF  WORK 

The  next  step  in  the  development  of  the  MBDS  system  is  to  correct  the 
communication-related  timing  defect  as  indicated  in  item  4  of  the  previous 
section.  This  may  require  some  modifications  of  the  operating  system,  i.e.. 
Berkely  4.3  Unix. 


46 


F  F.t’CESS IHG  IIAE 


<=• i  s'  w  .  '  ti  i  n  *  ht  Ft,  F  p  r 
f  ,!!>  -r  1  nr  V.  .  : . . :  *  he  ir-. 

:  ••  t  h>--  ir  de 

i  .  !  k  :  <•  f  •  ,  -  ;  .  .  •  ■ 

a  Li  .1 

-•cess  . 

'■p  C 
- —  -  / 

f  i  no  :i 

ruff 

'  ;  E 

Th-  i 

the  it  L-a 

Tr.tr  ti 
an-l  «• j 

reset. taf  •_  .  of  fun  :*  i  ns  which 

time n Tat  i  n  [  l  :  tides  t n f rnia f.  i  -  a,  oil 
sic  capabilities,  and  the  fil^ 
r  un;  ent.ati  c  r.  will  tr  c  v  i  :i  o  a 

e  l  ien  /ehi  users. 

ftv;  t:  :: 

5  F  *  :  ■  i 

e  'Tr,r  :-cr 

If.  !.  n 

re  :  : 

F  •  •  ■  ■  P  ini' 

i.  i :  i  *;  s  r 

IP  ~i  !  ’  ' 

s  I'i'  i  i.  *• 

i ni * i a  lire 

inxtiala  «,»  communication  channels 

•l.k  W!i 'm:-  it-  • 

•*-  O  *  *  <“■  »- 

t!£'*'  the  ne::t  message 

i:  re  q u e c *:  w a  i  1 1 n  z  f  c i  r  e q i  : . 

!  !  '  'i 

IP"!  : 

t  ...  real®?*  id  in  message  buffer 

S  li  li  i.  J  V 

receive  a  me  s  s  a jo 

•  ;•  : 

wai*.  :.r:i 

v:  i *  f  - ;  message  <  i  T  '  'Cii  le4  i  -  n 

Sen  lei  JH  F 

reep  ci  ■ 

jet  the  sender 

1  FT  \'i: 

•  V i  el  !  I  F 

re  :|  :  :  :  message  fres  r  tf 
re.-f  ci  get  the  message  type 

1  po,  |f  r  ressim 

Ff  in  ! .  i  i  r  4  F  i  V 

reef  re. 

re  ;:i  sr 

p recess  a  request 
return  request  in  buffer 

•ie*  km;  ,i  |  ti 

dl  tnipm>:- 

•J  get  ptr  tc  recoil  template 

RPc  ;f,t 

rbabs 

allocate  a  result  buffer 

Aid  FT  P  F  Li 

1  'heck  f  t  Py 

alls’  • 

alls*:  e 

allocate  storage  for  request 
allocate  hash  info  structure 

rT.i.jr  '■  r 

allst: 

finds  any  agg  op  in  request  tails 

F  i  1 J  F  ?  F 

reepsi 

get  the  request  id 

ITT  I nre: t 

TP  FFT'H 

■i e ‘  free  die  tea 

5 1  i  n  s 

H  i  5  v  5 

disks 

case  INSERT 

fetch  a  t  ra-k  buffer  f ~r  insertion 
qe*  a  region 

i  inf-  1 :  re  : 

i i  sks 

put  information  in  *•  he  i»3i  -  r, 

maf  b  r°:j 

f  i  r.  :i  :!  i  r  re  j 

.  -j  i  c:  i.-  c? 

disks 

mar  *■  f  h°  t e  i i  n 

get  inde::  of  die  entry 

pi  a  i  T  F 

uni •  i i  s 

V  5  5  <=>  t  t  h  °  T  F  ptr 

•  :  ; :  ? 

re cp  cr 

send  I,  0  message  tc  Eli 

5p:’  ; 


Qet  free  lii 

?  r eg 

disk? 

get 

a 

region 

f  ■  ut  i  n  f  '  •:!  i 

re.; 

disk  s 

put 

information  in  the  re.ii 

-  I; 

ma;  _dic_reg 

disks 

map 

to 

the  region 

■'  as  a  1  : 

FIN .7  FF 

~  t  *  t  rz 

ins;- 

ins 

ert 

a  record 

$  1 1  IN5EP 

T  RE  TORI 

ins;: 

ins 

ert 

the  record  into  the 

t  ra- 

ok  buffer 

TF  CT  p r 

disks 

s  t  c 

i  e 

track  buffer  back  to 

the 

disk 

find  di 

i  eg 

disks 

get 

index  of  dio  entry 

ma^  die 

rea 

disks 

map 

to 

the  region 

•  >  a  : 

ah  ‘  ”e 

DioCFf 

c 

reepsi 

sen 

'J  I 

.'0  message  to  DIO 

CP  ^  •' 1 


,7T  F  "  De  ] 

st:  ret.  -je 

1  case  DELETE 

TF  FETCH 

d  i  s  k  s 

fetch  a  track  buffer  for  insertion 

as  above 

RPCCENI  MFLETIvK 

rb  a!  r 

send  completion  signal  tv  controller,  77 

UA3H  Fn?J  ■ 

ret  ■’!: 

Pi  -ai-’a?^  Target  I 

SVI:  .1 

nfc  re-.  '  m 

HASH  THE  F.F'7r,F  D 

I  ef  -in 

F  lit  HashPuf  fer 

ret  cvr.: 

Bucket Block 

<0  h  /-•/“*»»> 

StoreRe cord 

ret  com 

A1 1 or Block 

ret  corn  • 

allocate  a  block 

Broadcast  Target 

Info  ref 

send 

HEP  :,e 

ret  com 

PE7_7NTL$FT  S 

reepsr 

send  the  results  to  the  controller 

send 

FES  CMTLSF.F  S 

reaps r 

sen:!  the  results  to  the  controller 

sen-:! 


r 

M  F  i  n  F  «=•  q  $  P.  F  7 

reep  n r 

send 

the  request  id 

(update)  to  DM 

r* 

7  FinReqSF.F  S 

reepsr 

send 

the  request  id 

(non-update)  to  CC 

putF.i  .1 

reepsi 

put 

request  id  in  me 

ssare  buffer 

ST  U 

[■  d  a  *  e 

stupd 

case 

UPDATE 

TB 

FETCH 

disks 

fetch  a  track  buffer 

for  insertion 

: •  as  above 

Feg 

r  tl ;  {■!  .  i  e Ge n  I  n  s  5  F E  .7 

r e :ps r 

send 

message  to  F.EC'F 

f*  T  1 '  \ 


P.F  ContinueGenlns 

rpcont 

INSEF.Ts  caused  by  an 

UPDATE 

can  continue 

FBSSEND  COMF  LET  I'd! 

-  '  as  above 

r  h  a  b  p 

send  comp-let  ion  siqna 

1  t  '? 

nt  roller,  77 

1  7 1 i  a  n  ■  i  e  1  7 1  u  s  F  e  s 

•  hanue  ! 

a  l  e  ■■:  i  i  lias  chan ue  1 

luster 

F.hr*':An3$  FF  F 

r  eaj  ?  r 

receive  ['M's  answer  o 

n  '.-hi s t 

°  t  chariM^1 

PFurr 

up  d[ 

map  d i '  ■  r  eg 

-  •  as  above 

d  i  sk  s 

ma;  to  the  reui<  n 

PH77SFF  S 

reepsi: 

send  the  new  record  t 

o  F.E  jt 

.9  e  i  r.  i 


Tr  :’T:rr 

•  *  a  2  ol  =  ..-ve 

disk  ; 

store  track  buffer  back  to  the  disk 

i  T J .  fl 

ri'.isr.  r 

n  o  me  r  e 

re  op si 

ik  more  generated  inserts  for  an  UPDATE 
act  the  request  id 

RE  c  SEIJI  COME  LETIvN  rfc-al.  s  send  completion  signal  to  controller,  DM 

•  as  at  ;  ve 

IFF  f;F  tecp:  .  "message"  from  self 

Type$PF  P  recpsr  get  the  message  type 

1  FeqF  t  ■  .  ess  i  i.  ; 

•  -  as  above 

#**######*##****»###* 

IFF  <'ijt I,  AM'.THFF  BE 
TypeS F F_F 

! •  re  i  j  • 

reel sr 

message  from  TI 
aet  the  message  type 

1  '  rum  n  mess  ages 

?  e  -  orhn « 2  j  .  rr»  a  p 

|  F  i  -1 ;  F  F  Ft 

r  0  c  t  •  2 1 

retrieve  common  -  alloca*:0  space 

All  C‘  •  PF  li  Pet 

S',  m  all st  . 

allocate  structure  space 

1  F-lS  1  qSFF  F 

1'P1'"  {'2  1 

««=•■  pt  r  to  next  msi  in  queue 

rpOCESS  BE  Target 
St  treFe  tor  d 

AIL  ••  1  -k 

ref  -jem 

ref  cm 

t  et  cor,. 

all  cate  a  block 

MERGE 

FES  CMTL$PF  S 
seii'J 

r  e  t  c  cm 

recpsr 

send  the  results  tc  the  controller 

1 DicStopSFF  S 

reaper 

send  a  stop  message  to  DIO 

seri  ! 


#),####*»########*##*###*#**####*#*############################################# 


I  F’F  L’  X '  ■  recproc 

T  y  p  ‘  ?  F  r  F  tecpsi 


F  Wi  i  t  e  ■ orrq  let  e  1 

r  e  •?  p  r  o 

FrIA  ill  S  F  F  F 

re-ps r 

I W  Insert 

wereqs 

F F  s  SEMI 1  Ct'I  FP  LET I ON 

rbabs 

•  ■  as  abov<=. 

o  FmFeqiFF  S 

rec|  s r 

put Fid 

reap s r 

F«>  T  fr°“ 

rpf 

sc*"  f  r  c ^  die  r e  n 

disks 

find  die  reg 

dis  k  s 

I  K  jafp 

werea? 

message  from  D I 'j 
uet  the  message  type 


physical  write  is  completed 
get  request  id  of  completed  read 

if  INSERT 

send  completion  signal  to  controller,  CC 

send  the  request  id  (non-update)  to  CC 
put  request  id  in  message  buffer 

f  re“  t  h>=  spa-e  used  b  y  a  request 

find  entry  for  a  request 
gef  mde::  of  dio  entry 

if  DELETE 


PRICE NT  C'.'Mr  LE7I-  -N  rbabs  send  completion  signal  to  controller,  CC 


v  '  a?  a r  ■  ‘  -*e 

IVv,  Update 

wcreqs 

if  UPDATE 

F.e  N.  !  !:  re  lenl 

F.F  S 

Le  cp  sr 

send  message  to  F.EQF 

FI  Continue  jer.lR? 

rpccnt 

INSERTS  caused  by  an  UFDATE  can  continue 

PPC-CEND  •"’OMF  LETT  "'H 

rbal  s 

send  completion  signal  to  controller,  CC 

•  as  a I.  ” *e 

cob  flf*?*0  "i  i  1  17  o  <7 

disks 

find  entry  for  a  request 

find  di o  reg 

disks 

get  inde::  of  dio  entry 

|  PesL’at  a  SKI  P 

recpsr 

restore  data  received  from  Die 

f  i  n  1  d  i  *  r  e  g 

disks 

get  inde::  of  dio  entry 

get  TP;*- 

unixdi sk 

:s  aet  ptr  tc  track  buffer 

1  !•  I  r»:- a  blor.-.j  l*--t  ed 

re  opr  oo 

physical  real  is  complete! 

F.i  ,1A  l-.ii  1  f.F  F 

1-“=-  P  SI 

net  re  .nest  id  of  completed  rca  i 

I  Ft;  Insert 

rep  re  c. 

if  INSERT 

!!H[  i  I  I  *r*  _I 

disks 

map  t-  the  region 

■  ■  as  ai 

.Vino  E’F'  • 

ins}. 

insert  a  record 

• .  •  as  at-:  ve 

1 P  '  Fet 

r  c  r  e  «• 

if  RETFEIVE  [ -COI-CiCNi 

TP  F  F  T 1  ~  H 

d  i  s  k  s  • 

fetch  a  track  buffer  for  insertion 

■  •  as  abc-.-e 

$  F'F.TF  PROCESSING 

retp 

process  RETREIVE 

map  -.iio  r  eg 

disks 

maj  to  the  region 

•  •  as  at-f-a 

chk  woe  ft 

chk.qry 

check  whether  record  satisfies  C-UEF.Y 

RF  aggregate 

ret  p 

calculate  any  aggregate  operations 

XT  FA.  :t 

retp 

get  attribute  and  value  for  target  list 

BT  HASH  FUN.' 

ret  by 

BY  HASH  PEC'. 

-F.r 

ret  by 

hash  and  store  the  records 

St  or  0 B y Ro o’ 

■ord 

ret  by 

A1 locByBlock 

retby 

add  a  new  bucket  to  the  end  of  the  list 

<> 

PRC  AT  FUT  SENT 

r  babs 

put  aggregate  results  into  result  buffer 

P8:FriT  SEN!- 

rbabs 

put  request  results  into  result  buffer 

O  as  abcv 

e 

f ill_res_Lu£f 

retf 

fill  result  buffer 

XTFAoT 

retp- 

aet  attribute  and  value  for  target  list 

BY  HASH  FUNC 

retby 

as  abov<= 

PP$F"T  SEND 

rbabs 

put  request  results  into  result  buffer 

HASH  FUNC 

ret  com 

<>  as  above 

P.ESjCNTLSF.F  s 

r  e  ■  j  p  s  i 

send  the  results  to  the  controller 

send 


Send  Hash  Inf : 

RB:  PUT  GENT 

ret  I  y 
r  babs 

put 

request  results 

into 

result  buffer 

■  -  as  above 
PB$AG_PUT_SENL- 

rbabs 

put 

aaqreaate  results  int 

o  result  buffer 

RBSFUT  SEN!; 

rbabs 

put 

request  results 

into 

result  buffer 

£  v  pc  -j  -  1T€;1I 

d  i  s  )■»  3 

find  entry  for  a  request 

fin  i  ■ :  1  :•  iej 

disk  s 

get  inde::  of  die-  entry 

•  :  :  f-::  "  :  letivN 

rbabs 

send  completion  signal  to  controller,  '  C 

"  *  a  3  a  i  v 

f :  h  ■: -r>k*ut. 

ret  by 

free  the  space  used  by  a  block 

f  .  c  ^  1 

P.e  •’  trof 

rp  free 

free  the  space  used  by  a  request 

1  F  '  Del  et  >- 

rcre.]s 

if  DELETE 

TP  FETCH 

di  sk  s 

fetch  a  track  buffer  for  insertion 

ar  ai.  ve 

$DEI,  Ff  - ' 'ECFIM  i 

delr 

P'iccess  DELETE 

maj  d  i  ■.  ;  ••  : 

rj  2_  *=:  V  <; 

map  to  the  region 

as  above 

u  y  r'.nrpv 

chkqry 

check  whether  record  satisfies  QUEPV 

Tp  S  T  '  -  F  F 

disks 

store  track  buffer  back  t-  the  disk 

v. ■  a  3  at  eve 

p.f  scene  _cv::r  leti  ct: 

rb  a  be 

send  completion  signs,  t:  c.ntrciler,  -  >. 

set  lift-  11.  ter 

disks 

find  entry  for  a  request 

f  in  :i  di  -  re  :i 

di  sKs 

aet  inde::  cf  die  entry 

IF”  U*  • 

r  •*:*  r  e  q  s 

if  UPDATE 

TF  FETCH 

disks 

fetch  a  track  buffer  for  inset ti  i. 

as  al'.ve 

$UI l  FKOCEGSINC 

updi 

process  UPDATE 

map  di«"  reg 

disks 

map  to  the  region 

■  •  as  above 

CHK  ,'!.’EF  T 

chkqry 

check  whether  record  satisfies  CUEF.'i 

IM  _UP  FT 

up  ip- 

increment  records  being  updated 

$t’FD  P.ECOPI 

updp 

UFt'ATE  the  record 

OIJVSF.F  g 

re:i’5  i 

ask  DM  whether  record  changes  cluster 

s  e  n  i 

F.eqr  MoMoreGenlnsSF.F  S 

r  e  c  p  3  r 

send  message  to  PEQF 

seiii 

Pb  Cent inueGenl ns 

r peent 

INSERTS  caused  by  an  UPDATE  can  continue 

P.B$SEND  COMPLETION 

rtafcs 

send  completion  signal  to  controller,  CC 

•>  as  above 

set  free  die  rea 

disks 

find  entry  for  a  request 

find  die  reg 

disks 

get  inde::  of  dio  entry 

|PF  shutdown 

recproc 

shutdown  process 

f  i  n  i  s  h  s  i 

sndrev 

finish  send, receive 

COMMON  FUMCTI OH" 


FI  Mt'  P.F  ri 


findrp  get  ptr  to  request  info  structure 


AF FEND IX  E. 


RECORD  PROCESSING  PSEUDO-CODE 


This  documentation  is  a  midlevel  presentation  of  events  occuring 
wit  hin  the  F.ECF  process.  The  intent  is  to  provide  the  user  with  a 
basic  uii  Jer standing  of  the  activity  that  occurs  during  specific  events. 
If  dies  not  represent  the  enact  steps  taken  within  a  function. 


Enter nal  Variables 


s f  i  u'-t  1 1  inf  ' 
st  r  u'.-t.  RF  rid  inf  •. 
struct  F.F  rid  in£c 
■ha  : 


di  •  _reg [MA"_r IO_FEG] 

*  f  r  ont _R.F  r  id_inf  o 

*  r  ear _P.F_r  id_inf  o 
*TF 


f  pe  u '"ode 


Initialize  process  (RecF  init  in  recpi  .  c  > 

Initialize  communication  channels  (initsi  in  sndrcv.c) 

Initialize  variables  related  to  disks  (disk_init  in  disks. c) 

Set  up'  the  track  buffers  fcr  each  region  used  by  disk  I/O 

Set  di  -  _r  eg  [  DIO_REG )  .  t  i_reg_st  at  us  =  REG_FF.EE  (not  being  used) 

Set  StopSys  =  FALSE 

Enter  message  receiving  loop;  continue  while  StopSys  =  FALSE 
Get  the  nent  message  (Msg$P.F_F.  in  recpsr.c) 

Check  if  any  request  is  waiting  for  a  region  (chk  waiting  req  in  chkwait. c) 
Traverse  linked  list  of  struct  RF_r  id  info' s  to  check  whether  anv  lias 
RF_r i_status  of  WAITING 
If  a  request  is  waiting  fcr  a  regie n 

Put  traffic  id  and  request  number  into  message  buffer 

Fill  message  header  with  sender  and  receiver  equal  to  F.ECF;  and  type 
equal  to  0Lr_F.EO 
Ret  urn 

Else  if  no  request  is  waiting  for  a  region 

Check  to  see  if  there  is  a  new  message  (receive) 

Wait  flag  is  TRUE 

If  there  is  a  message  return 

Wait  for  a  message  or  an  I/O  completion  (wait  msa  in  waitmsg.c) 

[Can  this  function  be  reached?] 

Get  the  sender  name  of  the  message  (Sender$RP  P.) 

Switch  on  message  sender 

I**#########*#######*######################################################### 

case  DM  (RF_DM) 

Get  the  type  of  the  message  (Type$F.P_P.  in  recpsr.c) 

Switch  on  message  type 

case  PeqDiskAddrs  ( ReqP roces s i no  in  recproc.c) 

Get  the  request  (ReqAddrs$RP_R  in  recpsr.c) 

Copy  the  database  id  into  dbid[] 

Cop'y  the  request  into  the  request  table  ( request ->req_tbl ) 

Copy  number  of  addresses  into  addrs->as  no  adirs 

Copy  each  disk,  cylinder,  track  no  set  into  addrs->as  addrsjn] 

Copy  new  track  flag  to  NewTrack 

Copy  traffic  id  and  request  number  from  request  table  inti  stni'-r  FegTd 
If  INSERT  set  tmpl  index  =  7  else  set  tmp>l  index  =  8 
Get  pt r  <tmpl_ptr)  to  struct  rt emp_de f i nit  ion 

Get  pt  r  (P.F_rb)  to  a  result  buffer  structure  (RBSGET  in  rbabs.c) 

Copy  traffic  id  and  request  number  from  rid  into  request  buffer 
Set  RP_next_empty_pos  =  0 


{•-  t  (Kt  ri  jti  1  to  stru  "t  P.B  rid  infc  (the  main  struct  for  the  process) 
(All  ?T‘*'_P.F_  r  i  in 

i  f  pftf  ifvf  - '•'•i-r-:-  r: 

It  r.  t  F.ETKI  EVE  -CoMMoN 

All  -v-j*  <=■  sj.  .vce  f  r  the  new  F.F  rii  infc 

link  to  list  of  Fr  rid  infc;  set  front  FF  rid  inft  and  rear  PE  rid  info 
traffic  id  and  request  n under  from  rid  into  F.F  ri  rid 
t '  NULL  (F.F  r  i  has;,,  F.F  by  hash,  F.F  aag  ptr) 

;o»  Pr**|['*1fi<"  —  FALSE 

•r  i  y  (  hr-  d  at  al  are  i  d  frci:  dl  i  d  [  j  rnt  .  FF_ri_dfc  id 
t  request  int  o  F.F  ri  dl:  :  d 

Copy  address  set  (disk,  cylinder ,  track)  into  RF_ri_dbid 
It  FETF-IFVF. 

If  net  RETRIEVE 

Set.  jl  t  r  in  RF  r i  dfcid  to  aggiegate_inf c  to  NULL 
S  *  - 1  a  lit  :.  ■  f  t.  !.-•  index  t  .  be  real  (ct-id:  ini'  L_  r’ 

Link  ri  emj  definition  t  :  F.F_:  id  inf. 

Link  Result Buffer  to  RI  rid  info 

Fill  M  _ri_ur-pt  i  )  in  RF_r  id_ir.f  wi»  h  •  '  r 

If  n  t.  nr  I  ATE  caused  by  IKSEFT 

:•••=♦  FF  r  i_  st  atus  in  FF_ti  d  inf  to  N  ST  KAITIN- 
If  "F I  ATE  caused  by  INSERT 


Vf dr o  r  st I haseKa it  in j> 


(NewTtack  ==  FALCE i 

If  itis»i  *  in.)  a  record  into  a  neve  track  (NewTrack  ==  TFVE  • 

L  .-ok  for  a  free  region  (get_f  ree_dic_reg  in  disks,  c) 

Em  l  1st  erit  ry  in  global  di-  reoi  array  with  ti  reg  status  ==  PEC  EPEE 
T  f  found  set  its  ti  ret  status  =  FEC  IN  M?F. 

If  £re»  region  found 

Fut  information  in  the  region  (put  inf  c  die  re  a  in  disks,  c) 

Fill  in  traffic  id  and  request  number 
Fill  in  disk,  cylinder,  and  track  numbers 
Find  entry  and  map  to  the  region  (map  di  re  g  in  disks. c) 

Find  the  entry  for  a  request  (find  die  tea  in  disks . c) 

Match  request  and  storage  info  to  dio  reg  elements  until  found 
Return  index  to  entry  (ind  dio  reg) 

Map  to  the  region  (map  TF  in  unixdisks.c) 

Set  track  buffer  (TB)  to  tb  entry  corresponding  to  tb_infc  entry 
"et  t  he  beginning  of  each  record  size!  division  to  no  rec  ('?') 

Se*  the  end  -of  the  buffer  to  EOTtack  ('(.') 

Issue  the  write  ($INS  PROCESSING  in  insp.c) 

Get  ptr  (PF_ri_ptr)  to  the  RF_rid_info  entry  (FINS  FF_ri ) 

Insert  the  record  into  track  buffer  ( $  IF_IHSEF.T_RECOP.r  in  insp.c) 

S  'an  t  rack  buffer  to  find  the  firs':  fr00  sic*,  to  inset*  the  record 
If  found 

Set  first  byte  to  ree_exist  ('1') 

Set  pt  t  (ptr)  to  next  byte 
For  each  attrilu'e 

Write  value  followed  ly  EcFieii  ('S’ ) 

Fill  xn  EC'F.ecord  ('#') 

Record  will  be,  for  examp  1  *» :  1  vaJ  uel$value2Svalue?S  * 

"nmap  from  the  region  (umaf  die  reg  in  disks.") 

Free  the  TB  so  it  does  not  point  anywhere  (umap  TP  in  unixdisks . c) 
Set  TB  to  NULL 


S*-‘  FF  i  i  n.  cemj.  let  eJ  w:  it  es  -  ( 

S  e  * .  tliis  BE  t  •.  ins  ctunt  =  c 
;  n  :  _m:  r  e  ger._inr  mr  c_r  :*•  FALSE 
If  ''II'  AT  E  cause  1  by  INSERT  (F.F_ri_?*a*  us  = 

F *•  ‘  Ml  r. 

’  “  !  t  VI  e  fi  cm  teqtl  1 
If  INSERT  (ST  Insert  in  stins.  > 

If  insert  i  no  a  re  cot  d  into  an  old  *  i  a  -k 


I  f 


ft:  re  TF_A?K  BUFFER  bark  tc  the  disk  according  t:  addr  (TF  UT  rf.i 
Find  the  entry  for  a  request  ( find  die  reg  in  disks. cl  (as  ai  '  ve' 
Find  entry;  map  to  the  region  (maj  die  reg  in  disks.;!  (as  al  . ve i 
TF  points  to  the  region 

Send  the  infc  to  DISK  I.  .  (Dio$PF  S  in  rerf  sr.c) 

Send  request  identifiers  and  contents  of  track 
Set  the  t  i_reg_stat  us  for  the  region  to  REG  WF.ITE 
Unrr.ap  from  the  region  (urnap  die  tea  in  disks. c)  (as  afc-  :---e) 

If  free  region  not  found 

Set  F.F_n_status  tc  WAITIUG 

RETRIEVE,  RETRIEVE-COMMON ,  DELETE  (ST  EetDel  in  stretdel.c) 


If  UPDATE  (ST  Update  in  stupd.c) 


•  +  +  + 


+  +  +  4  4  +  +  ■ 

case  ChangeiolusF.es  (Changed  ClusF.es  in  cha.ued.c) 

■rase  HiMoieGenlns  (No  MoreGenlns  nomoie .  •■) 

r'  55  ft  f  <r>  t  •  * 

....  t  ,  .  |  .  R.  i 

- ■  a c -  }■  r.  ■  f  tpv  _pq  » 

llessao"  fret;  'self'  ;  a  backlogged  request  is  processed;  nr  actual  message  is 

r  e  -<-.j  ,-e  l 

■«►**  t  h»  t  ype  'f  the  message  (Type$PF  F  in  respsi  .cl 
S wit  -  h  c n  me  ?  ?  a  ge  t  ype 


■•as**  •  •LD  P.Ev  (F.eqF  recessing  in  lecp  ly;  .  c  I 

ttOOl.tlMMIoillMIlUiKOIIKHOilUIKMttttlttttlXXXtxtixt.c.to 

•••I.oe  p'Lf  ( p.F  _CNTL_ANt'THEF._BE_MS;’  in  re.qioc.cl 
Get  the  type  of  the  message  (TypeSP.F  P  in  recpsr.c) 
f'wit  -h  n  iw-ssage  type 

'.'ommen  messages 

case  Ret  i  :omliot.  i f  i cat. ion 


case  Bucket  Inf  : 

I  t  4  4  4  14  4  i  444444+-4  +  444  +  44  +  +4+44+4444+  +  4. 

~ase  Rtr.| 

•ase  DP'  (PF  Dio) 

Get.  t  lie  type  of  the  message  (Type$F.F_P.  in  recpsr.c) 

Switch  on  message  type 

case  FIO  WFITE  (P.P  Wr  i  t  eComplet  e  J  in  recproc.c) 

444444+4. 

case  F  I F.EAD 

Rest -are  data  from  message  buffer  to  track  buffer  (P.esDataSFF  F  in  recpsr.c) 
A  physical  read  is  completed  (F.F  F.eadCompleted  in  recproc.c) 

I#################################### ######################################### 

Ghijt.J  -wn  I'rcress  (F.F_shutdown  in  recproc.c) 

Finish  send  receive  (finishsr  in  sndrov.e) 


APPENDIX  C.  TRANSACTION  DOCUMENTATION 


This  documentation  is  a  low-level  presentation  of  the  specific  events  occurring  within  the 
Pl'THASHBl'FFER  function  of  RFCP.  It  provides  the  function's  name,  a  short  description 
of  variables  passed  in.  and  a  logical  fkm  of  event';. 


Function  Name:  PL'THASHBL'FFER 

The  following  variables  are  passed  in: 

1.  hi.ptr :  This  variable  points  at  the  function  hashinfo.  The  function  hashinfo  stores  the 
intermediate  results  of  a  retrieve  common. 

2.  bucket:  This  is  the  virtual  storage  address;  the  bucket  number. 

.v  at tr  value:  This  is  the  specific  attribute  value  of  the  query. 

4.  record  :  This  is  the  contents  of  the  result  buffer  after  the  attribute  name  and  the  attribute 
value  has  been  extracted. 

5.  last  record:  This  flag  indicates  whether  a  particular  record  is  the  last  from  a  specific 
backend. 


This  loop  mechanism 
counts  llic  number  of 
characters  in  the  record  and 
assigns  to 

r_index  for  later  use. 
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To  a  test  to  see  if  the  buffer 
is  too  full  for  the  record. 


Arriving  from  the  process  lhai  controlled  storage 
of  the  bucket  number. 


Arm  ing  front  a  lest  10  determine  i!  d::s  is  die 
last  record.  If  i!  is.  we  send  die  come n is  of  di 
HASHBl  f  FER  to  be  Stored 


CALL  BI  CKE'I  BLOCK 

PASS  CONTENTS  OF 
THE  H  ASHRl'FFER  AS 
INDICATED  EARLIER. 


When  the  test  determines  the  last  record  has 
not  been  sent,  it  just  returns  to  die  calling 
function. 


APPENDIX  D.  GL  IDE  TO  MESSAGE  ENTRIES 


A  MESSAGE  FORMAT  INFORMATION 

This  appendix  contains  the  format  of  al!  messages  utilized  on  MBDS.  Additionally,  an 
example  of  the  fomiat  of  a  Bucket  Info  message  is  provided.  The  message  format  that  is 
used  within  MBDS  is  illustrated  below: 


Type:  (message  type):  This  is  represented  by  a  3  digit  number. 

Sender:  | sending  process(es)):  This  is  represented  by  a  3  digit  number. 

Reciever:  [receiving  process  (es)|  This  is  represented  by  a  3  digit  number. 

One  special  note:  if  a  Put  is  the  reciever.  the  message  is  relayed  to  the 
Get  in  another  machine. The  ultimate  reciever  of  the  messages  is 
indicated. 


A  Rucketlnfo  message  is  presented  below  to  illustrate  the  placement  of  the  above 
fomiat  information. 


..1..  ..3.. 

501504252XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX& 


1 :  Sender 
2:  Reciever 
3:  Message  type 
4:  Message  body 


=  RECP 

=  P_PCLB  (  all  other  backends) 

=  BUCKET  INFO 

=  The  message  body  will  contained  target  records  of  a 
retrieve-common 
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